Starbucks Capstone Challenge

Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (ie 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

Data Sets

The data is contained in three files:

  • portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json - demographic data for each customer
  • transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

  • id (string) - offer id
  • offer_type (string) - type of offer ie BOGO, discount, informational
  • difficulty (int) - minimum required spend to complete an offer
  • reward (int) - reward given for completing an offer
  • duration (int) -
  • channels (list of strings)

profile.json

  • age (int) - age of the customer
  • became_member_on (int) - date when customer created an app account
  • gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
  • id (str) - customer id
  • income (float) - customer's income

transcript.json

  • event (str) - record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) - customer id
  • time (int) - time in hours. The data begins at time t=0
  • value - (dict of strings) - either an offer id or transaction amount depending on the record

Note: If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.

You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:

Then you will want to run the above command:

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.

NOTE:

Running the entire notebook for the first time (without grid search) takes about 1.5 hours. Searching for hyperparameters in one of the implemeted model takes about 40 minutes more. So, if enabled, it would take more than 2 hours to run the complete notebook for the first time.

That's why a boolean was created to control the grid search, but it is set to True by default (if it was False, the best model found would be assigned manually from what was found in previous runs).

If you run the notebook for a second time, or if you previously created the datasets (by running the function "make_datasets.py" for example), the notebook running time will be reduced.

You can toggle the grid search on/off by changing the "grid_search" variable in the code accessible through the link below:

Toggle grid search (the "Top" link will take you back here)

In [256]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
from sklearn.cluster import KMeans
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [257]:
# read in the json files
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)

1. EDA

Top

1.A. Portfolio

Top

In [258]:
print(portfolio.shape)
portfolio.sort_values(by='offer_type')
(10, 6)
Out[258]:
channels difficulty duration id offer_type reward
0 [email, mobile, social] 10 7 ae264e3637204a6fb9bb56bc8210ddfd bogo 10
1 [web, email, mobile, social] 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 bogo 10
3 [web, email, mobile] 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 bogo 5
8 [web, email, mobile, social] 5 5 f19421c1d4aa40978ebb69ca19b0e20d bogo 5
4 [web, email] 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 discount 5
5 [web, email, mobile, social] 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 discount 3
6 [web, email, mobile, social] 10 10 fafdcd668e3743c1bb461111dcafc2a4 discount 2
9 [web, email, mobile] 10 7 2906b810c7d4411798c6938adc9daaa5 discount 2
2 [web, email, mobile] 0 4 3f207df678b143eea3cee63160fa8bed informational 0
7 [email, mobile, social] 0 3 5a8bc65990b245e5a138643cd4eb9837 informational 0

Let's substitute the "channels" column with one hot encoded columns.

In [259]:
portfolio = pp.channels_ohe(portfolio)
portfolio
Out[259]:
difficulty duration id offer_type reward channel_mobile channel_web channel_email channel_social
0 10 7 ae264e3637204a6fb9bb56bc8210ddfd bogo 10 1 0 1 1
1 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 bogo 10 1 1 1 1
2 0 4 3f207df678b143eea3cee63160fa8bed informational 0 1 1 1 0
3 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 bogo 5 1 1 1 0
4 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 discount 5 0 1 1 0
5 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 discount 3 1 1 1 1
6 10 10 fafdcd668e3743c1bb461111dcafc2a4 discount 2 1 1 1 1
7 0 3 5a8bc65990b245e5a138643cd4eb9837 informational 0 1 0 1 1
8 5 5 f19421c1d4aa40978ebb69ca19b0e20d bogo 5 1 1 1 1
9 10 7 2906b810c7d4411798c6938adc9daaa5 discount 2 1 1 1 0

1.B. Profile

Top

In [260]:
print(profile.shape)
profile.head()
(17000, 5)
Out[260]:
age became_member_on gender id income
0 118 20170212 None 68be06ca386d4c31939f3a4f0e3dd783 NaN
1 55 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0
2 118 20180712 None 38fe809add3b4fcf9315a9694bb96ff5 NaN
3 75 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0
4 118 20170804 None a03223e636434f42ac4c3df47e8bac43 NaN
In [261]:
profile.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
age                 17000 non-null int64
became_member_on    17000 non-null int64
gender              14825 non-null object
id                  17000 non-null object
income              14825 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.1+ KB
In [262]:
profile.isnull().sum()
Out[262]:
age                    0
became_member_on       0
gender              2175
id                     0
income              2175
dtype: int64

The age = 118 looks strange

In [263]:
profile.age.value_counts()[:10]
Out[263]:
118    2175
58      408
53      372
51      363
54      359
59      359
57      353
52      351
55      350
56      342
Name: age, dtype: int64

It's almost certain that 118 was the value used for NaNs in the age. It also seems likely that the customers that have any missing data, have all the 'profile' data missing. Let's test that.

In [264]:
profile.age = profile.age.replace(118, np.nan)
profile.isnull().sum(axis=1).unique()
Out[264]:
array([3, 0])

That means that customers have 3 missing values or none, as supposed earlier.

Let's separate the customers in those who have missing data and those who don't. If a customer has all its profile missing, and doesn't have any entry in the transcript dataframe, then it should be ignored (as we don't have any information of use about the client). Are there any of those?

In [265]:
profile['missing_demographics'] = profile.isnull().any(axis=1).astype(int)
profile.head()
Out[265]:
age became_member_on gender id income missing_demographics
0 NaN 20170212 None 68be06ca386d4c31939f3a4f0e3dd783 NaN 1
1 55.0 20170715 F 0610b486422d4921ae7d2bf64640c50b 112000.0 0
2 NaN 20180712 None 38fe809add3b4fcf9315a9694bb96ff5 NaN 1
3 75.0 20170509 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 0
4 NaN 20170804 None a03223e636434f42ac4c3df47e8bac43 NaN 1
In [266]:
utils.common_values(profile.id, transcript.person)
Intersection: 17000
Total set 1: 17000
Not in set 2: 0
Total set 2: 17000
Not in set 1: 0
Total: 17000

All the customers in the "profile" dataframe are in the "transcript" dataframe also. That means all customers give at least some information, and cannot be dropped.

Let's transform the dates

In [267]:
profile.became_member_on = pd.to_datetime(profile.became_member_on, 
                                          format='%Y%m%d')
profile.head()
Out[267]:
age became_member_on gender id income missing_demographics
0 NaN 2017-02-12 None 68be06ca386d4c31939f3a4f0e3dd783 NaN 1
1 55.0 2017-07-15 F 0610b486422d4921ae7d2bf64640c50b 112000.0 0
2 NaN 2018-07-12 None 38fe809add3b4fcf9315a9694bb96ff5 NaN 1
3 75.0 2017-05-09 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 0
4 NaN 2017-08-04 None a03223e636434f42ac4c3df47e8bac43 NaN 1

And encode the gender

In [268]:
gender_dict = {'F': 0, 'M': 1, 'O': 2, None: np.nan}
gender_dict_inverse = {0: 'F', 1: 'M', 2: 'O', np.nan: None}
profile.gender = profile.gender.replace(gender_dict)
profile.head()
Out[268]:
age became_member_on gender id income missing_demographics
0 NaN 2017-02-12 NaN 68be06ca386d4c31939f3a4f0e3dd783 NaN 1
1 55.0 2017-07-15 0.0 0610b486422d4921ae7d2bf64640c50b 112000.0 0
2 NaN 2018-07-12 NaN 38fe809add3b4fcf9315a9694bb96ff5 NaN 1
3 75.0 2017-05-09 0.0 78afa995795e4d85b5d9ceeca43f5fef 100000.0 0
4 NaN 2017-08-04 NaN a03223e636434f42ac4c3df47e8bac43 NaN 1

Let's find a good format to use the dates in the estimators

In [269]:
# Create a feature that shows when a customer became member, 
# as the number of days since January 1st, 1970, to the signup date.
profile['member_since_epoch'] = (
    profile.became_member_on - dt.datetime(1970,1,1)).dt.days
profile.head()
Out[269]:
age became_member_on gender id income missing_demographics member_since_epoch
0 NaN 2017-02-12 NaN 68be06ca386d4c31939f3a4f0e3dd783 NaN 1 17209
1 55.0 2017-07-15 0.0 0610b486422d4921ae7d2bf64640c50b 112000.0 0 17362
2 NaN 2018-07-12 NaN 38fe809add3b4fcf9315a9694bb96ff5 NaN 1 17724
3 75.0 2017-05-09 0.0 78afa995795e4d85b5d9ceeca43f5fef 100000.0 0 17295
4 NaN 2017-08-04 NaN a03223e636434f42ac4c3df47e8bac43 NaN 1 17382

How do the features look like?

In [270]:
profile.income.hist(bins=30)
Out[270]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc00475bba8>
In [271]:
profile.gender.value_counts(dropna=False)
Out[271]:
 1.0    8484
 0.0    6129
NaN     2175
 2.0     212
Name: gender, dtype: int64
In [272]:
profile.age.hist(bins=30)
Out[272]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc003d30748>
In [273]:
profile.became_member_on.hist(bins=30)
Out[273]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc01cb70320>
In [274]:
sns.pairplot(profile[['age', 
                      'income', 
                      'gender', 
                      'member_since_epoch']].dropna())
Out[274]:
<seaborn.axisgrid.PairGrid at 0x7fc003c915c0>

There is a very strange age-income "stair" pattern... I assume that is due to the inner workings of the simulator.

Finding: The simulator seems to cap the income for younger customers, in discrete steps.

1.C. Transcript

Top

In [275]:
print(transcript.shape)
transcript.head()
(306534, 4)
Out[275]:
event person time value
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1 offer received a03223e636434f42ac4c3df47e8bac43 0 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2 offer received e2127556f4f64592b11af22de27a7932 0 {'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3 offer received 8ec6ce2a7e7949b1bf142def7d0e0586 0 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4 offer received 68617ca6246f4fbc85e91a2a49552598 0 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}
In [276]:
transcript.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
event     306534 non-null object
person    306534 non-null object
time      306534 non-null int64
value     306534 non-null object
dtypes: int64(1), object(3)
memory usage: 9.4+ MB
In [277]:
transcript.time.unique()
Out[277]:
array([  0,   6,  12,  18,  24,  30,  36,  42,  48,  54,  60,  66,  72,
        78,  84,  90,  96, 102, 108, 114, 120, 126, 132, 138, 144, 150,
       156, 162, 168, 174, 180, 186, 192, 198, 204, 210, 216, 222, 228,
       234, 240, 246, 252, 258, 264, 270, 276, 282, 288, 294, 300, 306,
       312, 318, 324, 330, 336, 342, 348, 354, 360, 366, 372, 378, 384,
       390, 396, 402, 408, 414, 420, 426, 432, 438, 444, 450, 456, 462,
       468, 474, 480, 486, 492, 498, 504, 510, 516, 522, 528, 534, 540,
       546, 552, 558, 564, 570, 576, 582, 588, 594, 600, 606, 612, 618,
       624, 630, 636, 642, 648, 654, 660, 666, 672, 678, 684, 690, 696,
       702, 708, 714])
In [278]:
transcript.event.unique()
Out[278]:
array(['offer received', 'offer viewed', 'transaction', 'offer completed'],
      dtype=object)

Let's get the "value" data out of the dictionaries, and then check each type of event separately.

In [24]:
transcript = pp.unwrap_transcript(transcript)
In [25]:
transcript[transcript.event == 'offer received'].head()
Out[25]:
event person time amount offer_id reward
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN
1 offer received a03223e636434f42ac4c3df47e8bac43 0 NaN 0b1e1539f2cc45b7b9fa7c272da2e1d7 NaN
2 offer received e2127556f4f64592b11af22de27a7932 0 NaN 2906b810c7d4411798c6938adc9daaa5 NaN
3 offer received 8ec6ce2a7e7949b1bf142def7d0e0586 0 NaN fafdcd668e3743c1bb461111dcafc2a4 NaN
4 offer received 68617ca6246f4fbc85e91a2a49552598 0 NaN 4d5c57ea9a6940dd891ad53e9dbe8da0 NaN
In [26]:
transcript[transcript.event == 'offer viewed'].head()
Out[26]:
event person time amount offer_id reward
12650 offer viewed 389bc3fa690240e798340f5a15918d5c 0 NaN f19421c1d4aa40978ebb69ca19b0e20d NaN
12651 offer viewed d1ede868e29245ea91818a903fec04c6 0 NaN 5a8bc65990b245e5a138643cd4eb9837 NaN
12652 offer viewed 102e9454054946fda62242d2e176fdce 0 NaN 4d5c57ea9a6940dd891ad53e9dbe8da0 NaN
12653 offer viewed 02c083884c7d45b39cc68e1314fec56c 0 NaN ae264e3637204a6fb9bb56bc8210ddfd NaN
12655 offer viewed be8a5d1981a2458d90b255ddc7e0d174 0 NaN 5a8bc65990b245e5a138643cd4eb9837 NaN
In [27]:
transcript[transcript.event == 'transaction'].head()
Out[27]:
event person time amount offer_id reward
12654 transaction 02c083884c7d45b39cc68e1314fec56c 0 0.83 NaN NaN
12657 transaction 9fa9ae8f57894cc9a3b8a9bbe0fc1b2f 0 34.56 NaN NaN
12659 transaction 54890f68699049c2a04d415abc25e717 0 13.23 NaN NaN
12670 transaction b2f1cd155b864803ad8334cdf13c4bd2 0 19.51 NaN NaN
12671 transaction fe97aa22dd3e48c8b143116a8403dd52 0 18.97 NaN NaN
In [28]:
transcript[transcript.event == 'offer completed'].head()
Out[28]:
event person time amount offer_id reward
12658 offer completed 9fa9ae8f57894cc9a3b8a9bbe0fc1b2f 0 NaN 2906b810c7d4411798c6938adc9daaa5 2.0
12672 offer completed fe97aa22dd3e48c8b143116a8403dd52 0 NaN fafdcd668e3743c1bb461111dcafc2a4 2.0
12679 offer completed 629fc02d56414d91bca360decdfa9288 0 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 5.0
12692 offer completed 676506bad68e4161b9bbaffeb039626b 0 NaN ae264e3637204a6fb9bb56bc8210ddfd 10.0
12697 offer completed 8f7dd3b2afe14c078eb4f6e6fe4ba97d 0 NaN 4d5c57ea9a6940dd891ad53e9dbe8da0 10.0

The "event" column could be Label-encoded but, other than that, there don't seem to be many simple preprocessing actions to take. There is still a lot of data wrangling before having a well posed problem, though.

Let's make a dataset that doesn't take into account the particular person or offer, but rather their features.

In [29]:
static_df = pp.join_data(transcript, profile, portfolio)
static_df.head()
Out[29]:
event time amount reward age became_member_on gender income missing_demographics member_since_epoch difficulty duration offer_type reward_t channel_mobile channel_web channel_email channel_social
0 offer received 0 NaN NaN 75.0 2017-05-09 0.0 100000.0 0 17295 5.0 7.0 bogo 5.0 1.0 1.0 1.0 0.0
1 offer received 0 NaN NaN NaN 2017-08-04 NaN NaN 1 17382 20.0 10.0 discount 5.0 0.0 1.0 1.0 0.0
2 offer received 0 NaN NaN 68.0 2018-04-26 1.0 70000.0 0 17647 10.0 7.0 discount 2.0 1.0 1.0 1.0 0.0
3 offer received 0 NaN NaN NaN 2017-09-25 NaN NaN 1 17434 10.0 10.0 discount 2.0 1.0 1.0 1.0 1.0
4 offer received 0 NaN NaN NaN 2017-10-02 NaN NaN 1 17441 10.0 5.0 bogo 10.0 1.0 1.0 1.0 1.0
In [30]:
static_df.duration.unique()
Out[30]:
array([ 7., 10.,  5.,  4.,  3., nan])

Let's check how often offers are sent to customers, and if the timings are coordinated.

In [31]:
merged_df = pp.join_data(transcript, profile, portfolio, static=False)
merged_df.head()
Out[31]:
event person time amount offer_id reward age became_member_on gender income missing_demographics member_since_epoch difficulty duration offer_type reward_t channel_mobile channel_web channel_email channel_social
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN 75.0 2017-05-09 0.0 100000.0 0 17295 5.0 7.0 bogo 5.0 1.0 1.0 1.0 0.0
1 offer received a03223e636434f42ac4c3df47e8bac43 0 NaN 0b1e1539f2cc45b7b9fa7c272da2e1d7 NaN NaN 2017-08-04 NaN NaN 1 17382 20.0 10.0 discount 5.0 0.0 1.0 1.0 0.0
2 offer received e2127556f4f64592b11af22de27a7932 0 NaN 2906b810c7d4411798c6938adc9daaa5 NaN 68.0 2018-04-26 1.0 70000.0 0 17647 10.0 7.0 discount 2.0 1.0 1.0 1.0 0.0
3 offer received 8ec6ce2a7e7949b1bf142def7d0e0586 0 NaN fafdcd668e3743c1bb461111dcafc2a4 NaN NaN 2017-09-25 NaN NaN 1 17434 10.0 10.0 discount 2.0 1.0 1.0 1.0 1.0
4 offer received 68617ca6246f4fbc85e91a2a49552598 0 NaN 4d5c57ea9a6940dd891ad53e9dbe8da0 NaN NaN 2017-10-02 NaN NaN 1 17441 10.0 5.0 bogo 10.0 1.0 1.0 1.0 1.0
In [32]:
def get_differences(user_events):
    return pd.DataFrame((user_events.time - user_events.time.shift(1)).values)
In [33]:
# "delays" shows the time differences between consecutive offers for each
# user.
sent = merged_df[merged_df.event == 'offer received']
delays = sent.groupby('person').apply(get_differences).rename(
    columns={0: 'diff'})
delays = delays.unstack()
delays.head()
Out[33]:
diff
0 1 2 3 4 5
person
0009655768c64bdeb2e877511632db8f NaN 168.0 72.0 96.0 72.0 NaN
00116118485d4dfda04fdbaba9a87b5c NaN 408.0 NaN NaN NaN NaN
0011e0d4e6b944f998e987f904e8c1e5 NaN 168.0 168.0 72.0 96.0 NaN
0020c2b971eb4e9188eac86d93036a77 NaN 168.0 168.0 72.0 96.0 NaN
0020ccbbb6d84e358d3414a3ff76cffd NaN 168.0 72.0 96.0 NaN NaN
In [34]:
diffs = delays.values.flatten()
diffs = diffs[~np.isnan(diffs)]
diffs = pd.Series(diffs)
diffs.hist(bins=50)
plt.title('Time difference between two consecutive offers, for each user')
Out[34]:
Text(0.5, 1.0, 'Time difference between two consecutive offers, for each user')
In [35]:
diffs.value_counts()
Out[35]:
168.0    23754
72.0     19109
96.0      9557
336.0     2942
240.0     2908
408.0      810
504.0      156
576.0       47
dtype: int64
In [36]:
d_vals = diffs.sort_values().unique()
d_vals
Out[36]:
array([ 72.,  96., 168., 240., 336., 408., 504., 576.])
In [37]:
d_vals[1:] - d_vals[:-1]
Out[37]:
array([24., 72., 72., 96., 72., 96., 72.])

The offers are being clearly sent in a coordinated fashion, and sent in very specific times. The times are multiples of 24. Let's investigate further.

In [38]:
# Times contains the time at which each offer was sent
times = sent.groupby('person').apply(lambda x: pd.DataFrame(x.time.values)).rename(
    columns={0: 'times'}).unstack()
times.head()
Out[38]:
times
0 1 2 3 4 5
person
0009655768c64bdeb2e877511632db8f 168.0 336.0 408.0 504.0 576.0 NaN
00116118485d4dfda04fdbaba9a87b5c 168.0 576.0 NaN NaN NaN NaN
0011e0d4e6b944f998e987f904e8c1e5 0.0 168.0 336.0 408.0 504.0 NaN
0020c2b971eb4e9188eac86d93036a77 0.0 168.0 336.0 408.0 504.0 NaN
0020ccbbb6d84e358d3414a3ff76cffd 168.0 336.0 408.0 504.0 NaN NaN
In [39]:
send_times = times.values.flatten()
send_times = pd.Series(send_times[~np.isnan(send_times)])
send_times.hist(bins=50)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc01c3d47f0>
In [40]:
send_times.value_counts()
Out[40]:
408.0    12778
576.0    12765
336.0    12711
504.0    12704
168.0    12669
0.0      12650
dtype: int64

Let's check if they are all multiples of 24.

In [41]:
time_values = send_times.sort_values().unique()
time_values / 24
Out[41]:
array([ 0.,  7., 14., 17., 21., 24.])

Indeed, all the sending times are multiples of 24 (one day).

Let's look at the transactions in time, and the effect of the offers.

In [42]:
transactions = merged_df[merged_df.event == 'transaction']
transactions.time.hist(bins=50)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc01c3b3550>
In [43]:
transaction_t_values = transactions.groupby('time').time.count()
transaction_t_values.head()
Out[43]:
time
0     633
6     797
12    850
18    879
24    922
Name: time, dtype: int64

Transaction times go in steps of size 6, which is the minimum "tick" of simulator (or at least of the dataset provided). That means that transactions occur at any time of the day.

Let's look at the effects of the offers in transactions.

In [44]:
transaction_t_values.plot()
for time_value in time_values:
    plt.vlines(time_value, 
               transaction_t_values.min(),
               transaction_t_values.max(), 'r', label='offers sending')
plt.title('Amount of transactions and offers sending times')
Out[44]:
Text(0.5, 1.0, 'Amount of transactions and offers sending times')

The offers seems to be having a positive effect, overall. Also, that effect seems to last more than 168 hours after the start of the offer, which is about 7 days. Some offers have a duration of 10 days.

That means that there could be overlapping between the effects of one offer and the next. That should be checked.

Initially, at least, the effect of an offer will be assessed in one particular client, although it is clear that some offers also affect the behavior of other clients. In particular BOGO offers seem to be very suitable to acquire new clients, when an existing customer invites a coffe ("for free") to another person, for example. That won't be considered in this project, because it would be very difficult with the available data, but could be an interesting thing to research in the real world.

Is there any overlapping between the effects of offers?

Let's study one case (if there is overlapping in that case, we will know that the simulator makes offers that may overlap. If not, we may look at other customers or find a general procedure to find overlapping in the full dataset.).

In [45]:
person = merged_df[merged_df.person == merged_df.person[0]]
offers = person[person.event.isin(['offer received', 'offer completed'])]
offers = offers[['event', 'time', 'duration', 'offer_id']]
reception = offers[offers.event == 'offer received'].copy()
reception['expected_finish'] = reception.time + 24 * reception.duration
reception
Out[45]:
event time duration offer_id expected_finish
0 offer received 0 7.0 9b98b8c7a33c4b65b9aebfe6a799e6d9 168.0
53176 offer received 168 3.0 5a8bc65990b245e5a138643cd4eb9837 240.0
150598 offer received 408 7.0 ae264e3637204a6fb9bb56bc8210ddfd 576.0
201572 offer received 504 5.0 f19421c1d4aa40978ebb69ca19b0e20d 624.0
In [46]:
# This plots the times while each offer is "active".
for idx, row in reception.iterrows():
    x = np.arange(merged_df.time.max()).astype(int)  # Total time
    x_on = np.arange(row.time, row.expected_finish).astype(int)  # Time when the offer is "active"
    y = np.zeros(merged_df.time.max())
    y[x_on] = 1
    plt.plot(x, y)

spending = np.zeros(merged_df.time.max())
spending[person.time] = person.amount
plt.plot(x, spending / spending.sum())
Out[46]:
[<matplotlib.lines.Line2D at 0x7fc01c3404e0>]

There is clear overlapping in that case. That suggests that overlapping of offer effects is not a rare phenomenon, and will obviously introduce some distortions.

2. Preprocessing

Top

Based on what was seen in the EDA, some functions were implemented to preprocess the raw data. The initial functions preprocess the data individually, and then other functions create a "static" dataset. It is "static" in the sense that it is created to be useful under the assumption that the customers' behavior is stationary, and that an estimator can ignore the "absolute" time feature.

The preprocessing process will be shown below, using the implemented functions. After that, the full process can be run by calling just two functions:

  • basic_preprocessing
  • generate_static_dataset
In [47]:
%reset -f
In [48]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import datetime as dt

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [49]:
# read in the json files
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)

2.1 Basic Preprocessing

Top

Portfolio: One hot encode the offers' channels

In [50]:
portfolio = pp.process_portfolio(portfolio)
portfolio
Out[50]:
difficulty duration id offer_type reward channel_mobile channel_web channel_email channel_social
0 10 7 ae264e3637204a6fb9bb56bc8210ddfd bogo 10 1 0 1 1
1 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 bogo 10 1 1 1 1
2 0 4 3f207df678b143eea3cee63160fa8bed informational 0 1 1 1 0
3 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 bogo 5 1 1 1 0
4 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 discount 5 0 1 1 0
5 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 discount 3 1 1 1 1
6 10 10 fafdcd668e3743c1bb461111dcafc2a4 discount 2 1 1 1 1
7 0 3 5a8bc65990b245e5a138643cd4eb9837 informational 0 1 0 1 1
8 5 5 f19421c1d4aa40978ebb69ca19b0e20d bogo 5 1 1 1 1
9 10 7 2906b810c7d4411798c6938adc9daaa5 discount 2 1 1 1 0

Profile: Substitute 118 with NaN, format dates, create date feature, add "missing demographics" feature

In [51]:
profile = pp.process_profile(profile)
print(profile.shape)
profile.head()
(17000, 7)
Out[51]:
age became_member_on gender id income missing_demographics member_epoch_days
0 NaN 2017-02-12 None 68be06ca386d4c31939f3a4f0e3dd783 NaN 1 17209
1 55.0 2017-07-15 F 0610b486422d4921ae7d2bf64640c50b 112000.0 0 17362
2 NaN 2018-07-12 None 38fe809add3b4fcf9315a9694bb96ff5 NaN 1 17724
3 75.0 2017-05-09 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0 0 17295
4 NaN 2017-08-04 None a03223e636434f42ac4c3df47e8bac43 NaN 1 17382

Transcript: Unwrap the "value" dictionaries

In [52]:
transcript = pp.process_transcript(transcript)
print(transcript.shape)
transcript.head()
(306534, 6)
Out[52]:
event person time amount offer_id reward
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN
1 offer received a03223e636434f42ac4c3df47e8bac43 0 NaN 0b1e1539f2cc45b7b9fa7c272da2e1d7 NaN
2 offer received e2127556f4f64592b11af22de27a7932 0 NaN 2906b810c7d4411798c6938adc9daaa5 NaN
3 offer received 8ec6ce2a7e7949b1bf142def7d0e0586 0 NaN fafdcd668e3743c1bb461111dcafc2a4 NaN
4 offer received 68617ca6246f4fbc85e91a2a49552598 0 NaN 4d5c57ea9a6940dd891ad53e9dbe8da0 NaN

Do it all at once and join the dataframes.

In [53]:
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)

data, portfolio = pp.basic_preprocessing(portfolio, profile, transcript)

print(data.shape)
data.head()
(306534, 20)
Out[53]:
event person time amount offer_id reward age became_member_on gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t channel_mobile channel_web channel_email channel_social
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 NaN 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN 75.0 2017-05-09 F 100000.0 0 17295 5.0 7.0 bogo 5.0 1.0 1.0 1.0 0.0
1 offer received a03223e636434f42ac4c3df47e8bac43 0 NaN 0b1e1539f2cc45b7b9fa7c272da2e1d7 NaN NaN 2017-08-04 None NaN 1 17382 20.0 10.0 discount 5.0 0.0 1.0 1.0 0.0
2 offer received e2127556f4f64592b11af22de27a7932 0 NaN 2906b810c7d4411798c6938adc9daaa5 NaN 68.0 2018-04-26 M 70000.0 0 17647 10.0 7.0 discount 2.0 1.0 1.0 1.0 0.0
3 offer received 8ec6ce2a7e7949b1bf142def7d0e0586 0 NaN fafdcd668e3743c1bb461111dcafc2a4 NaN NaN 2017-09-25 None NaN 1 17434 10.0 10.0 discount 2.0 1.0 1.0 1.0 1.0
4 offer received 68617ca6246f4fbc85e91a2a49552598 0 NaN 4d5c57ea9a6940dd891ad53e9dbe8da0 NaN NaN 2017-10-02 None NaN 1 17441 10.0 5.0 bogo 10.0 1.0 1.0 1.0 1.0

2.2 Creation of the "static" dataset

Top

The process will be shown for a single customer, and then the full dataset is calculated.

In [54]:
person = data[data.person == data.person.iloc[0]]

Split the dataset for convenience, according to the event type.

In [55]:
received, viewed, completed, transactions = pp.split_transcript(person)

Add the "completed" column

The next thing to do is to add a column that tell whether an offer was completed or not. That is information "from the future", so it will be typically used as a "target", and not as a "feature". The "fill_completion" function takes care of finding whether an offer will be completed or not. The docstring can be seen below.

In [56]:
print('fill_completion\n' + '-'*100)
print(pp.fill_completion.__doc__)
fill_completion
----------------------------------------------------------------------------------------------------

    Looks in the records of one person and checks which offers where completed.
    A 'completed' column is set to 1 when the offer was completed. The finish
    time is also added.
    Args:
        received(pd.DataFrame): As returned from split_transcript
        completed(pd.DataFrame): As returned from split_transcript

    Returns:
        pd.DataFrame: The received dataframe with some new columns.
    
In [57]:
data = pp.fill_completion(received, completed)

Add the "viewed" and "success" columns

Now, let's fill a column that checks if an offer was viewed in its "active" period (called "viewed"), and a "success" column that marks if the offer was completed after being viewed, within the "active" period.

The docstring for the function can be seen below.

Note: All the "informational" offers will be marked with "success = 0", because they are never "completed".

In [58]:
print('fill_viewed\n' + '-'*100)
print(pp.fill_viewed.__doc__)
fill_viewed
----------------------------------------------------------------------------------------------------

    Checks if the offer was viewed in the active period of the offers.
    Also fills a column called 'success' that tracks whether an offer
    completion happened after a view.
    Args:
        data(pd.DataFrame): As returned from fill_completed
        viewed(pd.DataFrame): As returned from split_transcript

    Returns:
        pd.DataFrame: The received dataframe with some new columns.
    
In [59]:
data = pp.fill_viewed(data, viewed)

The next thing was to record the spending of each customer in the "offer's duration" (from reception until expiration), and the same amount from reception until "completion or expiration".

The corresponding "profit" columns were also created. To calculate realistic profits the "cost of production" would be needed. Also there is the clear fact that some offers are not intended to give profits in their duration or until completed but in a longer period. In the case of BOGO offers, it is clear that the objective is not that the customer "completes" the offer, but rather the posterior or secondary effects of the completion. In any case, given the lack of further information, the "profits" were calculated as the customer's "spending" minus the paid "reward" if the offer was completed.

The "actual reward" column shows the paid reward (zero if the offer was not completed).

In [60]:
print('fill_profits\n' + '-'*100)
print(pp.fill_profits.__doc__)
fill_profits
----------------------------------------------------------------------------------------------------

    Fills "spending" and "profits" related columns.
    The "spending" columns track the transactions of the client in the "active"
    period of an offer, and adds them. They are also summed in the period between the
    offer reception and the offer completion (if it is completed).
    The profits columns consider the paid rewards as a cost to the company and substract
    them.
    The paid reward is also recorded in the column "actual reward" (it is zero if the
    offer was not completed).
    Args:
        data(pd.DataFrame): As returned from fill_completed
        transactions(pd.DataFrame): As returned from split_transcript

    Returns:
        pd.DataFrame: The received dataframe with some new columns.
    
In [61]:
data = pp.fill_profits(data, transactions)
data = data.drop(['event', 'reward', 'amount'], axis=1)

data
Out[61]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty ... expected_finish finish success view_time viewed actual_reward profit_in_duration profit_until_complete spent_in_duration spent_until_complete
0 78afa995795e4d85b5d9ceeca43f5fef 0 9b98b8c7a33c4b65b9aebfe6a799e6d9 75.0 2017-05-09 F 100000.0 0 17295 5.0 ... 168.0 132.0 1 6.0 1 5.0 32.67 14.89 37.67 19.89
1 78afa995795e4d85b5d9ceeca43f5fef 168 5a8bc65990b245e5a138643cd4eb9837 75.0 2017-05-09 F 100000.0 0 17295 0.0 ... 240.0 240.0 0 216.0 1 0.0 49.39 49.39 49.39 49.39
2 78afa995795e4d85b5d9ceeca43f5fef 408 ae264e3637204a6fb9bb56bc8210ddfd 75.0 2017-05-09 F 100000.0 0 17295 10.0 ... 576.0 510.0 1 408.0 1 10.0 38.28 11.72 48.28 21.72
3 78afa995795e4d85b5d9ceeca43f5fef 504 f19421c1d4aa40978ebb69ca19b0e20d 75.0 2017-05-09 F 100000.0 0 17295 5.0 ... 624.0 510.0 0 NaN 0 5.0 43.28 16.72 48.28 21.72

4 rows × 28 columns

And that is how the "static" dataset looks for one customer alone. The entire process, for all the customers is resumed in the two functions below. It may take several minutes to run.

Complete Preprocessing resumed in two functions

In [62]:
# Read the data
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)

# Basic Preprocessing
print('Basic preprocessing')
%time data, portfolio = pp.basic_preprocessing(portfolio, profile, transcript)

# Create the static dataset (only if it doesn't exist already: it takes long)
print('Creating the static dataset (this may take several minutes)')
if os.path.exists(os.path.join(DATA_INTERIM, 'static_data.pkl')):
    static_data = pd.read_pickle(os.path.join(DATA_INTERIM, 'static_data.pkl'))
else:
    %time static_data = pp.generate_static_dataset(data)

# Save the static dataset
print('Saving')
static_dataset_path = os.path.join(DATA_INTERIM, 'static_data.pkl')
static_data.to_pickle(static_dataset_path)
Basic preprocessing
CPU times: user 2.99 s, sys: 63.7 ms, total: 3.05 s
Wall time: 2.04 s
Creating the static dataset (this may take several minutes)
Saving

3 The "Offer Success" problem statement

Top

In this part the functions necessary to solve the "offer success" prediction problem will be created: a function to create the train-test datasets, an encoder class, a function to show the feature importances of XGBoost models, and some validation and test functions. Many schemes for validation were considered: time-split, random K-fold, random train-val-test split, random train-val-test split by customer.

In [63]:
%reset -f
In [64]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import datetime as dt
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Dataset creation

The "static" dataset is adapted to be used in the "offer success" problem. That means removing all the information "from the future", from the inputs (the "success" feature will be the target, with some modifications). Also, the customer id and offer id may be removed, and a time-based split of the data perfomed. The function that takes care of that is called "get_success_data", and the docstring can be seen below.

In [65]:
print(sd.get_success_data.__doc__)
    Generates the dataset to predict whether an offer was successful.
    An offer is considered successful if it is viewed and then completed. In
    the case of informational offers a visualization alone may be considered a
    success or not.
    Args:
        basic_dataset_path(str): The path to the pickle containing the basic
            dataset
        time_limit(int): The limit to split the train and test sets.
        informational_success(boolean): Whether a visualization of an
            informational offer should be considered as a success.
        drop_time(boolean): Whether to drop the absolute time dependent
            features.
        anon(boolean): Whether to drop unique identifiers to customers and
            offers.


    Returns:
        X_train(pd.DataFrame): The training dataset.
        X_test(pd.DataFrame): The test dataset.
        y_train(pd.Series): The training target.
        y_test(pd.Series): The test target.
        BasicEncoder: An encoder to use in an ML pipeline.
    
In [66]:
X_train, X_test, y_train, y_test, encoder = sd.get_success_data()
In [67]:
X_train.head()
Out[67]:
age gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t channel_mobile channel_email channel_web channel_social
0 33.0 M 72000.0 0 17277 0.0 3.0 informational 0.0 1.0 1.0 0.0 1.0
1 33.0 M 72000.0 0 17277 0.0 4.0 informational 0.0 1.0 1.0 1.0 0.0
2 33.0 M 72000.0 0 17277 5.0 5.0 bogo 5.0 1.0 1.0 1.0 1.0
5 NaN None NaN 1 17646 5.0 5.0 bogo 5.0 1.0 1.0 1.0 1.0
7 40.0 O 57000.0 0 17540 0.0 4.0 informational 0.0 1.0 1.0 1.0 0.0
In [68]:
y_train.head()
Out[68]:
0    1
1    1
2    0
5    0
7    1
Name: success, dtype: int64

Validation

The following methods were considered for validation:

  • Time-split
  • Random K-fold validation (test is split by time)
  • Random train-val split (test is split by time)
  • Random train-val split by customers (test is split by time)

The time-split method was finally selected as the main validation method. The random train-val-split in customers was chosen as a sanity check (to test if the results are similar to those of the time-split).

Time split validation.

In [69]:
# Create a basic model

model = Pipeline([
    ('encoder', encoder),
    ('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1, 
                                feature_names=X_train.columns))
])

# Evaluate it with a time-split validation
evos.time_split_validation(model)
Training time: 21.39863634109497 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[17920  2212]
 [ 2165 15733]]
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.89      0.89     20132
           1       0.88      0.88      0.88     17898

   micro avg       0.88      0.88      0.88     38030
   macro avg       0.88      0.88      0.88     38030
weighted avg       0.88      0.88      0.88     38030

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[4943 1800]
 [1963 4072]]
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.73      0.72      6743
           1       0.69      0.67      0.68      6035

   micro avg       0.71      0.71      0.71     12778
   macro avg       0.70      0.70      0.70     12778
weighted avg       0.71      0.71      0.71     12778

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.6839674141261443 |
---------------------------------------------------
Out[69]:
(Pipeline(memory=None,
      steps=[('encoder', BasicEncoder()), ('estimator', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
        colsample_bytree=1,
        feature_names=Index(['age', 'gender', 'income', 'missing_demographics', 'member_epoch_days',
        'difficulty', 'duration', 'offer_type', 'reward...
        reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
        silent=True, subsample=1))]),
 array([1, 1, 0, ..., 1, 0, 1]),
 array([1, 1, 1, ..., 0, 1, 1]))

Random KFold validation

In [70]:
# Create a basic model

model = Pipeline([
    ('encoder', encoder),
    ('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1, 
                                feature_names=X_train.columns))
])

# Evaluate it with a random k-fold validation
evos.random_kfold_validation(model)
Fold - 1
Training time: 19.798202514648438 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[16131  1785]
 [ 1815 14140]]
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.90      0.90     17916
           1       0.89      0.89      0.89     15955

   micro avg       0.89      0.89      0.89     33871
   macro avg       0.89      0.89      0.89     33871
weighted avg       0.89      0.89      0.89     33871

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[6495 2464]
 [2494 5484]]
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.72      0.72      8959
           1       0.69      0.69      0.69      7978

   micro avg       0.71      0.71      0.71     16937
   macro avg       0.71      0.71      0.71     16937
weighted avg       0.71      0.71      0.71     16937

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.688685168906191 |
---------------------------------------------------
Fold - 2
Training time: 20.36731767654419 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[16028  1889]
 [ 1754 14201]]
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.89      0.90     17917
           1       0.88      0.89      0.89     15955

   micro avg       0.89      0.89      0.89     33872
   macro avg       0.89      0.89      0.89     33872
weighted avg       0.89      0.89      0.89     33872

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[6455 2503]
 [2528 5450]]
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.72      0.72      8958
           1       0.69      0.68      0.68      7978

   micro avg       0.70      0.70      0.70     16936
   macro avg       0.70      0.70      0.70     16936
weighted avg       0.70      0.70      0.70     16936

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.6842006151528467 |
---------------------------------------------------
Fold - 3
Training time: 19.86534857749939 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[16152  1765]
 [ 1763 14193]]
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.90      0.90     17917
           1       0.89      0.89      0.89     15956

   micro avg       0.90      0.90      0.90     33873
   macro avg       0.90      0.90      0.90     33873
weighted avg       0.90      0.90      0.90     33873

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[6533 2425]
 [2563 5414]]
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.73      0.72      8958
           1       0.69      0.68      0.68      7977

   micro avg       0.71      0.71      0.71     16935
   macro avg       0.70      0.70      0.70     16935
weighted avg       0.71      0.71      0.71     16935

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.6846231664137582 |
---------------------------------------------------
Training F1-score: 0.8876151865514176 +- 0.0013360302462744908

Validation F1-score: 0.6858363168242653 +- 0.004043630721861681

Random train-val split validation

In [71]:
# Create a basic model

model = Pipeline([
    ('encoder', encoder),
    ('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1, 
                                feature_names=X_train.columns))
])

# Evaluate it with a random train-val split validation
evos.random_1fold_validation(model)
Training time: 22.20442819595337 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[16849  1940]
 [ 1910 14866]]
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.90      0.90     18789
           1       0.88      0.89      0.89     16776

   micro avg       0.89      0.89      0.89     35565
   macro avg       0.89      0.89      0.89     35565
weighted avg       0.89      0.89      0.89     35565

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[5860 2226]
 [2230 4927]]
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.72      0.72      8086
           1       0.69      0.69      0.69      7157

   micro avg       0.71      0.71      0.71     15243
   macro avg       0.71      0.71      0.71     15243
weighted avg       0.71      0.71      0.71     15243

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.6886093640810622 |
---------------------------------------------------
Training F1-score: 0.8853552498362218

Validation F1-score: 0.6886093640810622

Random train-val split by customer

In [72]:
# Create a basic model

model = Pipeline([
    ('encoder', encoder),
    ('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1, 
                                feature_names=X_train.columns))
])

# Evaluate it with a random train-val split validation by customer
evos.random_1fold_cust_validation(model)
Training time: 23.49664330482483 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[16742  2130]
 [ 2055 14653]]
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.89      0.89     18872
           1       0.87      0.88      0.88     16708

   micro avg       0.88      0.88      0.88     35580
   macro avg       0.88      0.88      0.88     35580
weighted avg       0.88      0.88      0.88     35580

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[5716 2287]
 [2393 4832]]
Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.71      0.71      8003
           1       0.68      0.67      0.67      7225

   micro avg       0.69      0.69      0.69     15228
   macro avg       0.69      0.69      0.69     15228
weighted avg       0.69      0.69      0.69     15228

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.6737311767986615 |
---------------------------------------------------
Training F1-score: 0.8750410558060374

Validation F1-score: 0.6737311767986615

Test results

In [73]:
# Create a basic model

model = Pipeline([
    ('encoder', encoder),
    ('estimator', XGBClassifier(max_depth=7, n_estimators=1000, n_jobs=-1, 
                                feature_names=X_train.columns))
])

# Evaluate the model in the test set (time-splitted)
evos.offer_success_test(model)
Training time: 50.782225131988525 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[23433  3442]
 [ 3490 20443]]
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.87      0.87     26875
           1       0.86      0.85      0.86     23933

   micro avg       0.86      0.86      0.86     50808
   macro avg       0.86      0.86      0.86     50808
weighted avg       0.86      0.86      0.86     50808

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[10080  3908]
 [ 3432  8049]]
Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.72      0.73     13988
           1       0.67      0.70      0.69     11481

   micro avg       0.71      0.71      0.71     25469
   macro avg       0.71      0.71      0.71     25469
weighted avg       0.71      0.71      0.71     25469

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.6868333475552522 |
---------------------------------------------------
Training F1-score: 0.8550336693295412

Test F1-score: 0.6868333475552522

Conclusions:

  • All the validation methods give similar results to the test evaluation. The "hardest" validation seems to be the one that uses only previously unseen customers to validate. KFold validation takes the longest time, as expected. The time-split validation is the most similar to the way the test validation is done, and also seems to be the closest to the real scenario. It is also fast, so I will use that method for validation. Anyway, if possible and reasonable, I will report the "unseen customers" metrics (random split validation by customer), because I think they are informative.
  • The fact that the "Random" validations yield similar results to the "time-split" validation could be seen as an informal indicator of stationarity (at least in the one-month period under consideration).

4 Missing Data

Top

Two strategies were considered to deal with the missing data:

  • Fill the missing data with medians and most frequent values (class BasicImputer).
  • Fill the missing data using an ML estimator to predict the missing features from the non-missing ones (class EstimatorImputer).

A function to show some of the results of the imputers was also implemented.

In [74]:
%reset -f
In [75]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import datetime as dt
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import accuracy_score

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.visualization.visualize as vis
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

BasicImputer

In [76]:
print(md.BasicImputer.__doc__)
    Fills the demographics missing data with medians and most frequent values.
    Args:
        fill_mode(list(str)): The names of the columns to fill missing data with
        the most frequent value (other than gender value). This is used if new features
        are added to the dataset, that have missing data.
    
In [77]:
data = pd.read_pickle(os.path.join(DATA_INTERIM, 'static_data.pkl'))
print(data.shape)
data.head()
(76277, 28)
Out[77]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty ... expected_finish finish success view_time viewed actual_reward profit_in_duration profit_until_complete spent_in_duration spent_until_complete
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... 240.0 240.0 0 192.0 1 0.0 22.16 22.16 22.16 22.16
1 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... 432.0 432.0 0 372.0 1 0.0 8.57 8.57 8.57 8.57
2 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 33.0 2017-04-21 M 72000.0 0 17277 5.0 ... 528.0 414.0 0 NaN 0 5.0 17.68 3.57 22.68 8.57
3 0009655768c64bdeb2e877511632db8f 504 fafdcd668e3743c1bb461111dcafc2a4 33.0 2017-04-21 M 72000.0 0 17277 10.0 ... 744.0 528.0 0 NaN 0 2.0 94.87 12.11 96.87 14.11
4 0009655768c64bdeb2e877511632db8f 576 2906b810c7d4411798c6938adc9daaa5 33.0 2017-04-21 M 72000.0 0 17277 10.0 ... 744.0 576.0 0 NaN 0 2.0 67.20 8.27 69.20 10.27

5 rows × 28 columns

In [78]:
data.isnull().mean().sort_values(ascending=False)
Out[78]:
view_time                0.249748
age                      0.128164
gender                   0.128164
income                   0.128164
spent_until_complete     0.000000
reward_t                 0.000000
time                     0.000000
offer_id                 0.000000
became_member_on         0.000000
missing_demographics     0.000000
member_epoch_days        0.000000
difficulty               0.000000
duration                 0.000000
offer_type               0.000000
channel_mobile           0.000000
spent_in_duration        0.000000
channel_email            0.000000
channel_web              0.000000
channel_social           0.000000
completed                0.000000
expected_finish          0.000000
finish                   0.000000
success                  0.000000
viewed                   0.000000
actual_reward            0.000000
profit_in_duration       0.000000
profit_until_complete    0.000000
person                   0.000000
dtype: float64
In [79]:
data[data.viewed == 1].view_time.isnull().sum()
Out[79]:
0

The "view_time" column has missing values only in the places where the customer didn't see the offer, which is reasonable. There is nothing to fill there. The values to fill are those from the demographics of the customer: age, gender, income.

In [80]:
# Create a pipeline (it is necessary to encode before filling the missing data)
encoder_imputer = Pipeline([
    ('encoder', pp.BasicEncoder()),
    ('imputer', md.BasicImputer())
])

# Encode and fill with medians and most frequent values.
%time filled = encoder_imputer.fit_transform(data)

print(filled.isnull().mean().sort_values(ascending=False))
filled.head()
CPU times: user 311 ms, sys: 12 ms, total: 323 ms
Wall time: 317 ms
view_time                0.249748
spent_until_complete     0.000000
reward_t                 0.000000
time                     0.000000
offer_id                 0.000000
age                      0.000000
became_member_on         0.000000
gender                   0.000000
income                   0.000000
missing_demographics     0.000000
member_epoch_days        0.000000
difficulty               0.000000
duration                 0.000000
offer_type               0.000000
channel_mobile           0.000000
spent_in_duration        0.000000
channel_email            0.000000
channel_web              0.000000
channel_social           0.000000
completed                0.000000
expected_finish          0.000000
finish                   0.000000
success                  0.000000
viewed                   0.000000
actual_reward            0.000000
profit_in_duration       0.000000
profit_until_complete    0.000000
person                   0.000000
dtype: float64
Out[80]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty ... expected_finish finish success view_time viewed actual_reward profit_in_duration profit_until_complete spent_in_duration spent_until_complete
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 33.0 2017-04-21 1.0 72000.0 0 17277 0.0 ... 240.0 240.0 0 192.0 1 0.0 22.16 22.16 22.16 22.16
1 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 33.0 2017-04-21 1.0 72000.0 0 17277 0.0 ... 432.0 432.0 0 372.0 1 0.0 8.57 8.57 8.57 8.57
2 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 33.0 2017-04-21 1.0 72000.0 0 17277 5.0 ... 528.0 414.0 0 NaN 0 5.0 17.68 3.57 22.68 8.57
3 0009655768c64bdeb2e877511632db8f 504 fafdcd668e3743c1bb461111dcafc2a4 33.0 2017-04-21 1.0 72000.0 0 17277 10.0 ... 744.0 528.0 0 NaN 0 2.0 94.87 12.11 96.87 14.11
4 0009655768c64bdeb2e877511632db8f 576 2906b810c7d4411798c6938adc9daaa5 33.0 2017-04-21 1.0 72000.0 0 17277 10.0 ... 744.0 576.0 0 NaN 0 2.0 67.20 8.27 69.20 10.27

5 rows × 28 columns

In [81]:
vis.show_imputer_results(data, filled)

It can be seen that filling with medians and most frequent values, clearly modifies the values distribution.

Filling by fitting an estimator.

All the (relevant) missing data is in the profile dataframe. The problem is that when a customer has one feature missing it has almost all of them missing. That makes it difficult to estimate the missing data but, in the worst case, the imputed missing data will at least follow the distribution of the rest of the data.

In [82]:
print(md.EstimatorImputer.__doc__)
    Fills the demographics missing data with predictions from an estimator.
    
In [83]:
# Show the initial status
data = pd.read_pickle(os.path.join(DATA_INTERIM, 'static_data.pkl'))
data.isnull().mean().sort_values(ascending=False)
Out[83]:
view_time                0.249748
age                      0.128164
gender                   0.128164
income                   0.128164
spent_until_complete     0.000000
reward_t                 0.000000
time                     0.000000
offer_id                 0.000000
became_member_on         0.000000
missing_demographics     0.000000
member_epoch_days        0.000000
difficulty               0.000000
duration                 0.000000
offer_type               0.000000
channel_mobile           0.000000
spent_in_duration        0.000000
channel_email            0.000000
channel_web              0.000000
channel_social           0.000000
completed                0.000000
expected_finish          0.000000
finish                   0.000000
success                  0.000000
viewed                   0.000000
actual_reward            0.000000
profit_in_duration       0.000000
profit_until_complete    0.000000
person                   0.000000
dtype: float64
In [84]:
# Create a pipeline (it is necessary to encode before filling the missing data)
encoder_imputer = Pipeline([
    ('encoder', pp.BasicEncoder()),
    ('imputer', md.EstimatorImputer())
])

# Encode and fill with medians and most frequent values.
%time filled = encoder_imputer.fit_transform(data)

print(filled.isnull().mean().sort_values(ascending=False))
filled.head()
CPU times: user 33.6 s, sys: 40.8 ms, total: 33.6 s
Wall time: 33.6 s
view_time                0.249748
member_month             0.000000
channel_email            0.000000
time                     0.000000
offer_id                 0.000000
age                      0.000000
became_member_on         0.000000
gender                   0.000000
income                   0.000000
missing_demographics     0.000000
member_epoch_days        0.000000
difficulty               0.000000
duration                 0.000000
offer_type               0.000000
reward_t                 0.000000
channel_mobile           0.000000
channel_web              0.000000
member_year              0.000000
channel_social           0.000000
completed                0.000000
expected_finish          0.000000
finish                   0.000000
success                  0.000000
viewed                   0.000000
actual_reward            0.000000
profit_in_duration       0.000000
profit_until_complete    0.000000
spent_in_duration        0.000000
spent_until_complete     0.000000
member_day               0.000000
member_weekday           0.000000
person                   0.000000
dtype: float64
Out[84]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty ... viewed actual_reward profit_in_duration profit_until_complete spent_in_duration spent_until_complete member_day member_weekday member_year member_month
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 33.0 2017-04-21 1.0 72000.0 0 17277 0.0 ... 1 0.0 22.16 22.16 22.16 22.16 21 4 2017 4
1 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 33.0 2017-04-21 1.0 72000.0 0 17277 0.0 ... 1 0.0 8.57 8.57 8.57 8.57 21 4 2017 4
2 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 33.0 2017-04-21 1.0 72000.0 0 17277 5.0 ... 0 5.0 17.68 3.57 22.68 8.57 21 4 2017 4
3 0009655768c64bdeb2e877511632db8f 504 fafdcd668e3743c1bb461111dcafc2a4 33.0 2017-04-21 1.0 72000.0 0 17277 10.0 ... 0 2.0 94.87 12.11 96.87 14.11 21 4 2017 4
4 0009655768c64bdeb2e877511632db8f 576 2906b810c7d4411798c6938adc9daaa5 33.0 2017-04-21 1.0 72000.0 0 17277 10.0 ... 0 2.0 67.20 8.27 69.20 10.27 21 4 2017 4

5 rows × 32 columns

In [85]:
vis.show_imputer_results(data, filled)

That looks better than the BasicImputer, although it takes longer than it. I could use a metric like RMSE or accuracy to see if there is an improvement.

Missing data imputers evaluation

In [86]:
estimator_imputer = Pipeline([
    ('encoder', pp.BasicEncoder()),
    ('imputer', md.EstimatorImputer())
])

basic_imputer = Pipeline([
    ('encoder', pp.BasicEncoder()),
    ('imputer', md.BasicImputer())
])

Let's generate a clean dataset and some fake missing ages, incomes, and genders.

In [87]:
# Get a dataset without missing demographics
clean_data = data[data.missing_demographics == 0]

# Separate targets
target_feats = ['age', 'income', 'gender']
X = clean_data.drop(target_feats, axis=1)
y = clean_data[target_feats]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=2018)

# Join again, but only the training set. The test set will contain NaNs in the targets
data_train = X_train.join(y_train)
X_input = data_train.append(X_test, sort=False).sort_values(by='time')

# Show the results
print(X_input.isnull().mean().sort_values(ascending=False))
X_input.head()
gender                   0.300011
age                      0.300011
income                   0.300011
view_time                0.264763
channel_email            0.000000
time                     0.000000
offer_id                 0.000000
became_member_on         0.000000
missing_demographics     0.000000
member_epoch_days        0.000000
difficulty               0.000000
duration                 0.000000
offer_type               0.000000
reward_t                 0.000000
channel_mobile           0.000000
channel_social           0.000000
channel_web              0.000000
completed                0.000000
expected_finish          0.000000
finish                   0.000000
success                  0.000000
viewed                   0.000000
actual_reward            0.000000
profit_in_duration       0.000000
profit_until_complete    0.000000
spent_in_duration        0.000000
spent_until_complete     0.000000
person                   0.000000
dtype: float64
Out[87]:
person time offer_id became_member_on missing_demographics member_epoch_days difficulty duration offer_type reward_t ... view_time viewed actual_reward profit_in_duration profit_until_complete spent_in_duration spent_until_complete age income gender
48822 a2506fdc02c544a98f491ac97fa62f88 0 fafdcd668e3743c1bb461111dcafc2a4 2018-05-10 0 17661 10.0 10.0 discount 2.0 ... 0.0 1 2.0 18.58 12.83 20.58 14.83 64.0 42000.0 M
18018 3cabd0099961416b9810043e7eb0f743 0 ae264e3637204a6fb9bb56bc8210ddfd 2015-12-18 0 16787 10.0 7.0 bogo 10.0 ... 42.0 1 10.0 75.03 21.11 85.03 31.11 43.0 52000.0 M
578 020d72e77c704c42bb0a858c1aaa3bcd 0 3f207df678b143eea3cee63160fa8bed 2017-06-26 0 17343 0.0 4.0 informational 0.0 ... 36.0 1 0.0 40.87 40.87 40.87 40.87 68.0 57000.0 F
65149 da5b9a0ab0424d51a502f9423dc75c81 0 2906b810c7d4411798c6938adc9daaa5 2018-07-22 0 17734 10.0 7.0 discount 2.0 ... 84.0 1 0.0 5.15 5.15 5.15 5.15 23.0 32000.0 M
66253 ddd92ad1b92f4a79be14e1067731a87d 0 3f207df678b143eea3cee63160fa8bed 2017-10-08 0 17447 0.0 4.0 informational 0.0 ... 96.0 1 0.0 0.00 0.00 0.00 0.00 31.0 56000.0 M

5 rows × 28 columns

Let's test both imputers now

In [88]:
%time X_out_basic = basic_imputer.fit_transform(X_input)
CPU times: user 377 ms, sys: 15.9 ms, total: 393 ms
Wall time: 395 ms
In [89]:
%time X_out_estimator = estimator_imputer.fit_transform(X_input)
CPU times: user 26.2 s, sys: 44.5 ms, total: 26.2 s
Wall time: 26.3 s

Evaluation

In [90]:
# Name the true values of the missing data
age_true = y_test.age
income_true = y_test.income
gender_true = pp.gender_encode(y_test.loc[:, ['gender']])

# Evaluate the Basic Imputer
print('Basic Imputer\n')

age_basic_pred = X_out_basic.loc[y_test.index, :].age
income_basic_pred = X_out_basic.loc[y_test.index, :].income
gender_basic_pred = X_out_basic.loc[y_test.index, :].gender

print('Age RMSE: {}'.format(np.sqrt(mse(age_true, age_basic_pred))))
print('Income RMSE: {}'.format(np.sqrt(mse(income_true, income_basic_pred))))
print('Gender Accuracy: {}'.format(np.sqrt(accuracy_score(gender_true, 
                                                          gender_basic_pred))))
print('-'*100)

# Evaluate the Estimator Imputer
print('Estimator Imputer\n')

age_est_pred = X_out_estimator.loc[y_test.index, :].age
income_est_pred = X_out_estimator.loc[y_test.index, :].income
gender_est_pred = X_out_estimator.loc[y_test.index, :].gender

print('Age RMSE: {}'.format(np.sqrt(mse(age_true, age_est_pred))))
print('Income RMSE: {}'.format(np.sqrt(mse(income_true, income_est_pred))))
print('Gender Accuracy: {}'.format(np.sqrt(accuracy_score(gender_true, 
                                                          gender_est_pred))))
print('-'*100)
Basic Imputer

Age RMSE: 17.358943390751815
Income RMSE: 21749.830690857074
Gender Accuracy: 0.7546823081245384
----------------------------------------------------------------------------------------------------
Estimator Imputer

Age RMSE: 16.64788070800291
Income RMSE: 20668.84273603623
Gender Accuracy: 0.7843514886717338
----------------------------------------------------------------------------------------------------

As expected, the Estimator Imputer gives better results in all the features, but it takes much longer to process the data.

5 Clustering

Top

Several clustering techniques were used to try to better understand the dataset, and to generate new features for the estimators. The clustering algorithms used were:

  • K-Means
  • Hierarchical Clustering (Ward)
  • Gaussian Mixture Models
  • DBSCAN

5.A 4D Clustering (gender, age, income, member_since)

Top

The first approach was to use all the profile data available (4 features) to perform the clustering.

In [340]:
%reset -f
In [341]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import datetime as dt
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, ward, fcluster
from time import time
from sklearn.metrics import silhouette_score
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN
from mpl_toolkits.mplot3d import Axes3D
from sklearn.neighbors import KNeighborsClassifier

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

STATIC_DATASET_PATH = os.path.join(DATA_INTERIM, 'static_data.pkl')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.visualization.visualize as vis
import src.features.clustering as clust
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

5.A.a Clustering EDA

Top

In [330]:
# Get the data
X_train, X_test, y_train, y_test, encoder = sd.get_success_data(drop_time=False,
                                                                anon=False)
X_train.head()
Out[330]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t channel_mobile channel_email channel_web channel_social expected_finish
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 33.0 2017-04-21 M 72000.0 0 17277 0.0 3.0 informational 0.0 1.0 1.0 0.0 1.0 240.0
1 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 33.0 2017-04-21 M 72000.0 0 17277 0.0 4.0 informational 0.0 1.0 1.0 1.0 0.0 432.0
2 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 33.0 2017-04-21 M 72000.0 0 17277 5.0 5.0 bogo 5.0 1.0 1.0 1.0 1.0 528.0
5 00116118485d4dfda04fdbaba9a87b5c 168 f19421c1d4aa40978ebb69ca19b0e20d NaN 2018-04-25 None NaN 1 17646 5.0 5.0 bogo 5.0 1.0 1.0 1.0 1.0 288.0
7 0011e0d4e6b944f998e987f904e8c1e5 0 3f207df678b143eea3cee63160fa8bed 40.0 2018-01-09 O 57000.0 0 17540 0.0 4.0 informational 0.0 1.0 1.0 1.0 0.0 96.0

The interesting clustering is in the customers' features, as the offers are very few.

In [331]:
# Encode and filter relevant features
customer_feats = ['age', 'gender', 'income', 'missing_demographics', 
                  'member_epoch_days']
offer_feats = ['difficulty', 'duration', 'offer_type', 'reward_t', 
               'channel_web', 'channel_social', 'channel_mobile', 
               'channel_email']

X_train_t = encoder.fit_transform(X_train)
X_train_t = X_train_t[customer_feats]
X_test_t = encoder.transform(X_test)
X_test_t = X_test_t[customer_feats]

# Drop duplicates and missing data
X_train_t = X_train_t.dropna().drop_duplicates()
X_test_t = X_test_t.dropna().drop_duplicates()

# Keep a copy with the original demographics
X_train_o = pp.gender_decode(X_train_t.copy())
X_test_o = pp.gender_decode(X_test_t.copy())

# Drop the irrelevant column
X_train_t = X_train_t.drop('missing_demographics', axis=1)
X_test_t = X_test_t.drop('missing_demographics', axis=1)

# Normalize
scaler = StandardScaler()
scaler.fit(X_train_t)

X_train_t = pd.DataFrame(scaler.transform(X_train_t),
                         index=X_train_t.index,
                         columns=X_train_t.columns)
X_test_t = pd.DataFrame(scaler.transform(X_test_t),
                         index=X_test_t.index,
                         columns=X_test_t.columns)
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/ipykernel_launcher.py:29: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/ipykernel_launcher.py:32: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

In [332]:
print(X_train_t.shape)
X_train_t.head()
(14748, 4)
Out[332]:
age gender income member_epoch_days
0 -1.230638 0.769453 0.305519 0.146978
7 -0.827994 2.699367 -0.388848 0.774266
12 0.264895 -1.160461 1.138758 -0.838078
17 -1.748323 -1.160461 -0.249974 -0.237027
21 -1.633281 -1.160461 0.351810 0.292471
In [333]:
print(X_test_t.shape)
X_test_t.head()
(13883, 4)
Out[333]:
age gender income member_epoch_days
3 -1.230638 0.769453 0.305519 0.146978
11 -0.827994 2.699367 -0.388848 0.774266
16 0.264895 -1.160461 1.138758 -0.838078
20 -1.748323 -1.160461 -0.249974 -0.237027
25 -1.633281 -1.160461 0.351810 0.292471

Let's get some general visualization of the dataset

In [334]:
sns.pairplot(X_train_t)
Out[334]:
<seaborn.axisgrid.PairGrid at 0x7fc0168414e0>

There are some "artificial" stair-like shapes in the income-age plot, and also in the income-member_epoch_days plot. The thresholds could define clusters...

Let's use PCA to visualize the dataset in 2D

In [335]:
vis.pca_visualize(X_train_t)
Explained variance ratio for the first two components: 0.6148986186536066

The 2-D representation looks good for GMM or DBSCAN. In next sections some clustering algorithms will be tested.

5.A.b Validation Indexes

Top

For comparing the models and validation, I will use the Silhouette score, and DBCV (Density Based Clustering Validation). As the main objective of this section is to get some new features, I will keep the clustering indexes of the methods that yield the best score in each of the metrics. I could also keep the best sample of each algorithm.

DBCV was implemented here: https://github.com/christopherjenness/DBCV

Let's try that library.

In [336]:
# import DBCV
from scipy.spatial.distance import euclidean

kmeans = KMeans(n_clusters=4)
kmeans.fit(X_train_t)
cluster_labels = kmeans.predict(X_train_t)
In [337]:
# %time DBCV.DBCV(X_train_t.values, cluster_labels, euclidean)

Running that function for validation takes too long. I will only use the Silhouette score, and BIC in the case of GMM. Also the elbow method (when possible), and some visualizations with PCA.

5.A.c K-Means

Top

In [338]:
# A quick visualization of K-means clustering with 4 clusters
kmeans = KMeans(n_clusters=4)
%time kmeans.fit(X_train_t)
cluster = kmeans.predict(X_train_t)

vis.pca_visualize_clusters(X_train_t, cluster)
CPU times: user 299 ms, sys: 364 µs, total: 300 ms
Wall time: 299 ms
Explained variance ratio for the first two components: 0.7382034606332015
Explained variance ratio for the first two components: 0.5955638756544015
Explained variance ratio for the first two components: 0.752673245167816
Explained variance ratio for the first two components: 0.7046903139222775

The visualization is not too informative. Let's use the Silhouette score to determine the best number of clusters.

In [339]:
clusters = [{'n_clusters': n} for n in range(2, 30)]
silhouette, error, best_n = clust.validate_clustering(X_train_t, KMeans, clusters, clust.kmeans_error,
                                               'n_clusters')
Algorithm 1 of 28 finished in 4.0025999546051025 seconds.
Algorithm 2 of 28 finished in 4.099401235580444 seconds.
Algorithm 3 of 28 finished in 3.9410243034362793 seconds.
Algorithm 4 of 28 finished in 4.049719333648682 seconds.
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-339-c7b2fd090d3f> in <module>
      1 clusters = [{'n_clusters': n} for n in range(2, 30)]
      2 silhouette, error, best_n = clust.validate_clustering(X_train_t, KMeans, clusters, clust.kmeans_error,
----> 3                                                'n_clusters')

~/github_repos/starbucks-advertising/src/features/clustering.py in validate_clustering(X, clustering_algo, params, index_fun, n_clust_name)
     46         tic = time()
     47         method = clustering_algo(**param_set)
---> 48         labels = method.fit_predict(X)
     49         try:
     50             silhouette.append(silhouette_score(X, labels))

~/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in fit_predict(self, X, y, sample_weight)
    995             Index of the cluster each sample belongs to.
    996         """
--> 997         return self.fit(X, sample_weight=sample_weight).labels_
    998 
    999     def fit_transform(self, X, y=None, sample_weight=None):

~/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y, sample_weight)
    969                 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
    970                 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 971                 return_n_iter=True)
    972         return self
    973 

~/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, sample_weight, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
    378                 verbose=verbose, precompute_distances=precompute_distances,
    379                 tol=tol, x_squared_norms=x_squared_norms,
--> 380                 random_state=random_state)
    381             # determine if these results are the best so far
    382             if best_inertia is None or inertia < best_inertia:

~/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/cluster/k_means_.py in _kmeans_single_elkan(X, sample_weight, n_clusters, max_iter, init, verbose, x_squared_norms, random_state, tol, precompute_distances)
    442     centers, labels, n_iter = k_means_elkan(X, checked_sample_weight,
    443                                             n_clusters, centers, tol=tol,
--> 444                                             max_iter=max_iter, verbose=verbose)
    445     if sample_weight is None:
    446         inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)

sklearn/cluster/_k_means_elkan.pyx in sklearn.cluster._k_means_elkan.k_means_elkan()

~/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    162 
    163 # Pairwise distances
--> 164 def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False,
    165                         X_norm_squared=None):
    166     """

KeyboardInterrupt: 

OK, let's use KMeans with n = 8

In [ ]:
n_clusters = 8
kmeans = KMeans(n_clusters=n_clusters, random_state=2018)
kmeans.fit(X_train_t)
cluster_labels = kmeans.predict(X_train_t)

# 2D PCA visualization
vis.pca_visualize_clusters(X_train_t, cluster_labels)

# Heatmap of the cluster centers' features
plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

# The same as the heatmap, but with a barplot
plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

# Number of customers per cluster
plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')

Description of the clusters ():

  • 0) Low income young men. They joined the app more recently than the mean.
  • 1) Low income aged men that recently joined the app.
  • 2) High income women. Sensibly older than the mean.
  • 3) Aged women that joined recently. Slightly lower income than the mean.
  • 4) Early-adopter women. Sligthly higher income and age than the mean.
  • 5) High income men. Sensibly older than the mean.
  • 6) Low income young women. They joined the app more recently than the mean.
  • 7) Early-adopter men. Sligthly lower income and age than the mean.

    Reorganized (shows that for every "male" cluster there is a very similar "female" cluster, with minor differences):

  • 4) Early-adopter women. Sligthly higher income and age than the mean.
  • 7) Early-adopter men. Sligthly lower income and age than the mean.
  • 0) Low income young men. They joined the app more recently than the mean.
  • 6) Low income young women. They joined the app more recently than the mean.
  • 2) High income women. Sensibly older than the mean.
  • 5) High income men. Sensibly older than the mean.
  • 1) Low income aged men that recently joined the app.
  • 3) Aged women that joined recently. Slightly lower income than the mean.

In every cluster it seems like the female customers have a slightly higher income than the male customers. Other gender choices don't seem to be enough to influence the clusters in a noticeable way.

5.A.d Hierarchical (Ward)

Top

Let's look at a dendrogram for "Ward" hierarchical clustering.

In [ ]:
p = 50

%time linkage_matrix = ward(X_train_t)
%time ddg = dendrogram(linkage_matrix, truncate_mode='lastp', p=p)
plt.hlines(40, 0, 10 * p, colors='y')

With that limit, maybe 12 clusters would be reasonable. Let's check that with the Silhouette score.

In [ ]:
%time sns.clustermap(X_train_t, figsize=(10, 20), method='ward', cmap='viridis')

The most noticeable division seems to be by gender. The other features also seem to cluster the data. For example, "age" seems to divide the "male" cluster pretty well.

Let's get the best 'cutting distance' between clusters, according to the silhouette score.

In [ ]:
from collections import defaultdict

# cut_dists = np.arange(10, 150, 10)
cut_dists = np.exp(np.linspace(0, 5.5, 50))
results = defaultdict(dict)

# Cluster and gather results data
for i, max_d in enumerate(cut_dists):
    tic = time()
    clusters = fcluster(linkage_matrix, max_d, criterion='distance')
    n_clusters = len(np.unique(clusters))
    
    results['n_clusters'][max_d] = n_clusters
    if n_clusters >= 2:
        results['score'][max_d] = silhouette_score(X_train_t, clusters)
    else:
        results['score'][max_d] = 0
    
    toc = time()
    print('Algorithm {} of {} finished in {} seconds.'.format(
        i + 1, len(cut_dists), (toc - tic)))

    
# Show the results
best_d, best_score = max(results['score'].items(), key= lambda x: x[1])
best_n_clust = results['n_clusters'][best_d]
print(
    'The best Silhouette score is for {} clusters (maximum distance '.format(
    best_n_clust) + '= {}), and its value is: {}'.format(best_d, best_score))

# Plot silhouette score vs distance
m_dists = results['score'].keys()
scores = results['score'].values()
plt.plot(m_dists, scores)
plt.title('Silhouette score')
plt.vlines(best_d, min(scores), max(scores), 'r')
plt.xlabel('Maximum distance')

# Plot silhouette score vs n_clusters
plt.figure()
n, score_n = zip(*((results['n_clusters'][d], results['score'][d]) 
                   for d in results['n_clusters'].keys()))
plt.plot(n, score_n)
plt.title('Silhouette score')
plt.vlines(best_n_clust, min(score_n), max(score_n), 'r')
plt.xlabel('N clusters')

# Plot n_clusters vs distance
plt.figure()
m_dists = results['n_clusters'].keys()
n = results['n_clusters'].values()
plt.plot(m_dists, n)
plt.title('N clusters')
plt.xlabel('Max distance')
plt.ylabel('N')
plt.vlines(best_d, 0, max(n), 'r')

Zooming in the area of interest...

In [ ]:
# Plot silhouette score vs n_clusters
n_max = 20
n, score_n = zip(*((results['n_clusters'][d], results['score'][d]) 
                   for d in results['n_clusters'].keys()))
n = np.array(n)
score_n = np.array(score_n)
plt.plot(n[n<n_max], score_n[n<n_max])
plt.title('Silhouette score')
plt.vlines(best_n_clust, min(score_n[n<n_max]), max(score_n[n<n_max]), 'r')
plt.xlabel('N clusters')

It can be seen that about after 12 clusters the silhouette score decreases sensibly. That is in agreement with the initial intuition when looking at the dendrogram. As 12 clusters may be more informative than 2 (probably divided by gender), that may be the chosen number of clusters. Let's visualize both cases.

First, let's create a dataframe with the results

In [ ]:
res_df = pd.DataFrame(results)
res_df.index.name = 'distance'
res_df = res_df.reset_index()
res_df.head()

n_clusters = 2

In [ ]:
max_d = best_d
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')

That is just dividing by gender. Let's try with 12 clusters.

n_clusters = 12

In [ ]:
dist_12 = res_df[res_df.n_clusters == 12].distance.values[0]
print(dist_12)

max_d = dist_12
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')

What about 9?

In [ ]:
dist_9 = res_df[res_df.n_clusters == 9].distance.values[0]
print(dist_9)

max_d = dist_9
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')

The 9 clusters case looks very similar to that found by K-Means, with the difference that Ward seems to have found a new cluster for the people that chose "Other" in gender. We should check if that is the case. I will keep both cases (9 and 12) to generate new features.

5.A.e GMM

Top

For Gaussian Mixture Models, besides the Silhouette score, I will also record the "Aikake Information Criterion" index (AIC). The function that allows that is gmm_aic (docstring below).

In [ ]:
print(clust.gmm_aic.__doc__)
In [301]:
clusters = [{'n_components': n} for n in range(2, 30)]
silhouette, aic, best_n = clust.validate_clustering(X_train_t, GaussianMixture, clusters, 
                                                    clust.gmm_aic, 'n_components')
print('The best case, based on the Silhouette score in for {}'.format(best_n))
Algorithm 1 of 28 finished in 4.899714946746826 seconds.
Algorithm 2 of 28 finished in 3.7599172592163086 seconds.
Algorithm 3 of 28 finished in 3.7738938331604004 seconds.
Algorithm 4 of 28 finished in 3.8976151943206787 seconds.
Algorithm 5 of 28 finished in 4.060792922973633 seconds.
Algorithm 6 of 28 finished in 4.498376846313477 seconds.
Algorithm 7 of 28 finished in 4.455791711807251 seconds.
Algorithm 8 of 28 finished in 4.125568866729736 seconds.
Algorithm 9 of 28 finished in 4.208611249923706 seconds.
Algorithm 10 of 28 finished in 4.384902238845825 seconds.
Algorithm 11 of 28 finished in 4.536301136016846 seconds.
Algorithm 12 of 28 finished in 5.318991661071777 seconds.
Algorithm 13 of 28 finished in 4.743462562561035 seconds.
Algorithm 14 of 28 finished in 4.618605852127075 seconds.
Algorithm 15 of 28 finished in 4.895904064178467 seconds.
Algorithm 16 of 28 finished in 5.089925527572632 seconds.
Algorithm 17 of 28 finished in 5.110774755477905 seconds.
Algorithm 18 of 28 finished in 4.761854648590088 seconds.
Algorithm 19 of 28 finished in 5.329925537109375 seconds.
Algorithm 20 of 28 finished in 5.211430549621582 seconds.
Algorithm 21 of 28 finished in 5.093765020370483 seconds.
Algorithm 22 of 28 finished in 5.767065763473511 seconds.
Algorithm 23 of 28 finished in 6.405827760696411 seconds.
Algorithm 24 of 28 finished in 5.828770160675049 seconds.
Algorithm 25 of 28 finished in 5.253058910369873 seconds.
Algorithm 26 of 28 finished in 5.650788307189941 seconds.
Algorithm 27 of 28 finished in 5.989504337310791 seconds.
Algorithm 28 of 28 finished in 5.669895887374878 seconds.
The best Silhouette score is for {'n_components': 2}, and its value is: 0.2773163761270585
The error for {'n_components': 2} is: 73527.0604425528
The best case, based on the Silhouette score in for {'n_components': 2}
In [302]:
pd.DataFrame([clusters, aic, silhouette], index=['params', 'aic', 'silhouette']).T
Out[302]:
params aic silhouette
0 {'n_components': 2} 73527.1 0.277316
1 {'n_components': 3} -31770.1 0.261561
2 {'n_components': 4} -31740.1 0.261561
3 {'n_components': 5} -37327.5 0.255285
4 {'n_components': 6} -37477.8 0.210525
5 {'n_components': 7} -39364 0.214046
6 {'n_components': 8} -39415.8 0.161979
7 {'n_components': 9} -39695.3 0.200913
8 {'n_components': 10} -39947.9 0.192034
9 {'n_components': 11} -40006.4 0.198259
10 {'n_components': 12} -40150.6 0.117566
11 {'n_components': 13} -40394.5 0.208286
12 {'n_components': 14} -40269.1 0.176484
13 {'n_components': 15} -40475.6 0.2042
14 {'n_components': 16} -40682.2 0.195111
15 {'n_components': 17} -40913.3 0.206614
16 {'n_components': 18} -41474.5 0.186892
17 {'n_components': 19} -40640.1 0.198801
18 {'n_components': 20} -41230.2 0.194763
19 {'n_components': 21} -41066.6 0.195005
20 {'n_components': 22} -41384.4 0.189975
21 {'n_components': 23} -41542.8 0.181444
22 {'n_components': 24} -41401.4 0.188633
23 {'n_components': 25} -41529.2 0.186149
24 {'n_components': 26} -41597.8 0.200002
25 {'n_components': 27} -41822.9 0.202385
26 {'n_components': 28} -41646.9 0.181227
27 {'n_components': 29} -41651.8 0.188057

We should check if the case with 2 clusters is just looking for gender. I would also like to look at the 4-clusters case.

2 clusters

In [303]:
gmm = GaussianMixture(n_components=2)
gmm.fit(X_train_t)
cluster_labels = gmm.predict(X_train_t)

vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.7548832827480705
Explained variance ratio for the first two components: 0.7517565478399462
Out[303]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>

How would this look if we just divided by gender?

In [304]:
cluster_labels = (X_train.gender == 'F').astype(int)

vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.7548832827480709
Explained variance ratio for the first two components: 0.7517565478399465
/home/miguel/github_repos/starbucks-advertising/src/visualization/visualize.py:82: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

/home/miguel/github_repos/starbucks-advertising/src/visualization/visualize.py:82: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

Out[304]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>

As expected, the 2-clusters case is probably just dividing the dataset by gender. Not an interesting division (it won't give any new information)

4 clusters

In [305]:
gmm = GaussianMixture(n_components=4)
gmm.fit(X_train_t)
cluster_labels = gmm.predict(X_train_t)

vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.8825553935637607
Explained variance ratio for the first two components: 0.7509303356378052
Explained variance ratio for the first two components: 0.7239674699380194
Explained variance ratio for the first two components: 0.7779665053326823
Out[305]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>

The 4-clusters division is taking into account the people that answered "Other" to the gender question. It's a very small group in the complete dataset. It doesn't look like the Gaussian Mixture Models will give much new information.

5.A.f DBSCAN

Top

Let's grid search for DBSCAN parameters.

In [306]:
# Define DBSCAN parameters
n_batch = 20
m_batch = 5
eps_list = np.linspace(0.1, 2.0, n_batch).repeat(m_batch)
min_samples_list = [5, 20, 50, 100, 300] * n_batch
params = [{'eps': e, 'min_samples': ms} for e, ms in zip(eps_list, min_samples_list)]

# Search and gather the results
silhouette = list()
n_clusters = list()
for i, param_set in enumerate(params):
    tic = time()
    method = DBSCAN(**param_set)
    method.fit(X_train_t)
    labels = method.labels_
    try:
        silhouette.append(silhouette_score(X_train_t, labels))
    except ValueError:
        silhouette.append(0)
    n_clusters.append(len(np.unique(labels)))
    toc = time()
    print('Algorithm {} of {} finished in {} seconds. Params: {} Clusters: {}'.format(
        i + 1, len(params), (toc - tic), param_set, len(np.unique(labels))))
    
# Show the results
best_silhouette_params = params[np.argmax(silhouette)]
print('The best Silhouette score is for {}, and its value is: {}'.format(
    best_silhouette_params, max(silhouette)))
print('The number of clusters for {} is: {}'.format(

    best_silhouette_params, n_clusters[np.argmax(silhouette)]))
Algorithm 1 of 100 finished in 4.92281699180603 seconds. Params: {'eps': 0.1, 'min_samples': 5} Clusters: 278
Algorithm 2 of 100 finished in 0.24879765510559082 seconds. Params: {'eps': 0.1, 'min_samples': 20} Clusters: 1
Algorithm 3 of 100 finished in 0.22026896476745605 seconds. Params: {'eps': 0.1, 'min_samples': 50} Clusters: 1
Algorithm 4 of 100 finished in 0.2852349281311035 seconds. Params: {'eps': 0.1, 'min_samples': 100} Clusters: 1
Algorithm 5 of 100 finished in 0.2066514492034912 seconds. Params: {'eps': 0.1, 'min_samples': 300} Clusters: 1
Algorithm 6 of 100 finished in 4.488114833831787 seconds. Params: {'eps': 0.2, 'min_samples': 5} Clusters: 151
Algorithm 7 of 100 finished in 3.9703500270843506 seconds. Params: {'eps': 0.2, 'min_samples': 20} Clusters: 18
Algorithm 8 of 100 finished in 0.22869443893432617 seconds. Params: {'eps': 0.2, 'min_samples': 50} Clusters: 1
Algorithm 9 of 100 finished in 0.2562577724456787 seconds. Params: {'eps': 0.2, 'min_samples': 100} Clusters: 1
Algorithm 10 of 100 finished in 0.2619454860687256 seconds. Params: {'eps': 0.2, 'min_samples': 300} Clusters: 1
Algorithm 11 of 100 finished in 3.963482141494751 seconds. Params: {'eps': 0.3, 'min_samples': 5} Clusters: 29
Algorithm 12 of 100 finished in 3.8496475219726562 seconds. Params: {'eps': 0.3, 'min_samples': 20} Clusters: 10
Algorithm 13 of 100 finished in 3.9714646339416504 seconds. Params: {'eps': 0.3, 'min_samples': 50} Clusters: 3
Algorithm 14 of 100 finished in 4.210947036743164 seconds. Params: {'eps': 0.3, 'min_samples': 100} Clusters: 3
Algorithm 15 of 100 finished in 0.3154284954071045 seconds. Params: {'eps': 0.3, 'min_samples': 300} Clusters: 1
Algorithm 16 of 100 finished in 4.038808584213257 seconds. Params: {'eps': 0.4, 'min_samples': 5} Clusters: 19
Algorithm 17 of 100 finished in 4.011760473251343 seconds. Params: {'eps': 0.4, 'min_samples': 20} Clusters: 4
Algorithm 18 of 100 finished in 4.058872222900391 seconds. Params: {'eps': 0.4, 'min_samples': 50} Clusters: 3
Algorithm 19 of 100 finished in 4.182351589202881 seconds. Params: {'eps': 0.4, 'min_samples': 100} Clusters: 4
Algorithm 20 of 100 finished in 0.39670610427856445 seconds. Params: {'eps': 0.4, 'min_samples': 300} Clusters: 1
Algorithm 21 of 100 finished in 4.179344177246094 seconds. Params: {'eps': 0.5, 'min_samples': 5} Clusters: 10
Algorithm 22 of 100 finished in 4.125363111495972 seconds. Params: {'eps': 0.5, 'min_samples': 20} Clusters: 4
Algorithm 23 of 100 finished in 4.068257570266724 seconds. Params: {'eps': 0.5, 'min_samples': 50} Clusters: 3
Algorithm 24 of 100 finished in 4.017565011978149 seconds. Params: {'eps': 0.5, 'min_samples': 100} Clusters: 3
Algorithm 25 of 100 finished in 4.270282983779907 seconds. Params: {'eps': 0.5, 'min_samples': 300} Clusters: 2
Algorithm 26 of 100 finished in 4.25464129447937 seconds. Params: {'eps': 0.6, 'min_samples': 5} Clusters: 6
Algorithm 27 of 100 finished in 4.163466691970825 seconds. Params: {'eps': 0.6, 'min_samples': 20} Clusters: 4
Algorithm 28 of 100 finished in 4.137932777404785 seconds. Params: {'eps': 0.6, 'min_samples': 50} Clusters: 3
Algorithm 29 of 100 finished in 4.098619699478149 seconds. Params: {'eps': 0.6, 'min_samples': 100} Clusters: 3
Algorithm 30 of 100 finished in 4.098857641220093 seconds. Params: {'eps': 0.6, 'min_samples': 300} Clusters: 3
Algorithm 31 of 100 finished in 4.30359411239624 seconds. Params: {'eps': 0.7, 'min_samples': 5} Clusters: 6
Algorithm 32 of 100 finished in 4.2933948040008545 seconds. Params: {'eps': 0.7, 'min_samples': 20} Clusters: 4
Algorithm 33 of 100 finished in 4.374644756317139 seconds. Params: {'eps': 0.7, 'min_samples': 50} Clusters: 3
Algorithm 34 of 100 finished in 4.262623310089111 seconds. Params: {'eps': 0.7, 'min_samples': 100} Clusters: 3
Algorithm 35 of 100 finished in 4.168470621109009 seconds. Params: {'eps': 0.7, 'min_samples': 300} Clusters: 3
Algorithm 36 of 100 finished in 4.427581310272217 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 5} Clusters: 4
Algorithm 37 of 100 finished in 4.395532846450806 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 20} Clusters: 4
Algorithm 38 of 100 finished in 4.388932228088379 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 50} Clusters: 3
Algorithm 39 of 100 finished in 4.378281116485596 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 100} Clusters: 3
Algorithm 40 of 100 finished in 4.31186318397522 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 300} Clusters: 3
Algorithm 41 of 100 finished in 4.50065279006958 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 5} Clusters: 4
Algorithm 42 of 100 finished in 4.623832702636719 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 20} Clusters: 4
Algorithm 43 of 100 finished in 4.81090784072876 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 50} Clusters: 3
Algorithm 44 of 100 finished in 4.851483583450317 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 100} Clusters: 3
Algorithm 45 of 100 finished in 4.533484935760498 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 300} Clusters: 3
Algorithm 46 of 100 finished in 4.637465476989746 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 5} Clusters: 4
Algorithm 47 of 100 finished in 4.8197922706604 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 20} Clusters: 4
Algorithm 48 of 100 finished in 4.698916435241699 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 50} Clusters: 4
Algorithm 49 of 100 finished in 4.6876513957977295 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 100} Clusters: 3
Algorithm 50 of 100 finished in 4.655216932296753 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 300} Clusters: 3
Algorithm 51 of 100 finished in 4.92055082321167 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 5} Clusters: 4
Algorithm 52 of 100 finished in 4.814425468444824 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 20} Clusters: 4
Algorithm 53 of 100 finished in 4.826855421066284 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 50} Clusters: 4
Algorithm 54 of 100 finished in 4.83001708984375 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 100} Clusters: 3
Algorithm 55 of 100 finished in 4.831061124801636 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 300} Clusters: 3
Algorithm 56 of 100 finished in 4.950398921966553 seconds. Params: {'eps': 1.2, 'min_samples': 5} Clusters: 4
Algorithm 57 of 100 finished in 4.967822551727295 seconds. Params: {'eps': 1.2, 'min_samples': 20} Clusters: 4
Algorithm 58 of 100 finished in 4.939865827560425 seconds. Params: {'eps': 1.2, 'min_samples': 50} Clusters: 4
Algorithm 59 of 100 finished in 4.950360298156738 seconds. Params: {'eps': 1.2, 'min_samples': 100} Clusters: 3
Algorithm 60 of 100 finished in 5.057043075561523 seconds. Params: {'eps': 1.2, 'min_samples': 300} Clusters: 3
Algorithm 61 of 100 finished in 5.143296241760254 seconds. Params: {'eps': 1.3, 'min_samples': 5} Clusters: 4
Algorithm 62 of 100 finished in 5.099344730377197 seconds. Params: {'eps': 1.3, 'min_samples': 20} Clusters: 4
Algorithm 63 of 100 finished in 5.12005090713501 seconds. Params: {'eps': 1.3, 'min_samples': 50} Clusters: 4
Algorithm 64 of 100 finished in 5.104725122451782 seconds. Params: {'eps': 1.3, 'min_samples': 100} Clusters: 3
Algorithm 65 of 100 finished in 5.108429908752441 seconds. Params: {'eps': 1.3, 'min_samples': 300} Clusters: 3
Algorithm 66 of 100 finished in 5.2735373973846436 seconds. Params: {'eps': 1.4, 'min_samples': 5} Clusters: 3
Algorithm 67 of 100 finished in 5.242714881896973 seconds. Params: {'eps': 1.4, 'min_samples': 20} Clusters: 4
Algorithm 68 of 100 finished in 5.240509748458862 seconds. Params: {'eps': 1.4, 'min_samples': 50} Clusters: 4
Algorithm 69 of 100 finished in 5.264678716659546 seconds. Params: {'eps': 1.4, 'min_samples': 100} Clusters: 3
Algorithm 70 of 100 finished in 5.802436351776123 seconds. Params: {'eps': 1.4, 'min_samples': 300} Clusters: 3
Algorithm 71 of 100 finished in 6.203758955001831 seconds. Params: {'eps': 1.5, 'min_samples': 5} Clusters: 3
Algorithm 72 of 100 finished in 5.3563055992126465 seconds. Params: {'eps': 1.5, 'min_samples': 20} Clusters: 3
Algorithm 73 of 100 finished in 5.3370680809021 seconds. Params: {'eps': 1.5, 'min_samples': 50} Clusters: 4
Algorithm 74 of 100 finished in 5.3128502368927 seconds. Params: {'eps': 1.5, 'min_samples': 100} Clusters: 4
Algorithm 75 of 100 finished in 5.313564777374268 seconds. Params: {'eps': 1.5, 'min_samples': 300} Clusters: 3
Algorithm 76 of 100 finished in 5.469306468963623 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 5} Clusters: 3
Algorithm 77 of 100 finished in 5.580581903457642 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 20} Clusters: 3
Algorithm 78 of 100 finished in 5.454380989074707 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 50} Clusters: 4
Algorithm 79 of 100 finished in 5.453770875930786 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 100} Clusters: 4
Algorithm 80 of 100 finished in 5.455522060394287 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 300} Clusters: 3
Algorithm 81 of 100 finished in 5.5795207023620605 seconds. Params: {'eps': 1.7, 'min_samples': 5} Clusters: 3
Algorithm 82 of 100 finished in 5.5700953006744385 seconds. Params: {'eps': 1.7, 'min_samples': 20} Clusters: 3
Algorithm 83 of 100 finished in 6.872115850448608 seconds. Params: {'eps': 1.7, 'min_samples': 50} Clusters: 4
Algorithm 84 of 100 finished in 7.050173997879028 seconds. Params: {'eps': 1.7, 'min_samples': 100} Clusters: 4
Algorithm 85 of 100 finished in 6.948551893234253 seconds. Params: {'eps': 1.7, 'min_samples': 300} Clusters: 3
Algorithm 86 of 100 finished in 5.879750490188599 seconds. Params: {'eps': 1.8, 'min_samples': 5} Clusters: 3
Algorithm 87 of 100 finished in 5.7209696769714355 seconds. Params: {'eps': 1.8, 'min_samples': 20} Clusters: 3
Algorithm 88 of 100 finished in 6.6899120807647705 seconds. Params: {'eps': 1.8, 'min_samples': 50} Clusters: 3
Algorithm 89 of 100 finished in 8.762344598770142 seconds. Params: {'eps': 1.8, 'min_samples': 100} Clusters: 4
Algorithm 90 of 100 finished in 9.109920740127563 seconds. Params: {'eps': 1.8, 'min_samples': 300} Clusters: 3
Algorithm 91 of 100 finished in 8.255971193313599 seconds. Params: {'eps': 1.9, 'min_samples': 5} Clusters: 3
Algorithm 92 of 100 finished in 8.899447679519653 seconds. Params: {'eps': 1.9, 'min_samples': 20} Clusters: 3
Algorithm 93 of 100 finished in 8.45381474494934 seconds. Params: {'eps': 1.9, 'min_samples': 50} Clusters: 3
Algorithm 94 of 100 finished in 7.809722185134888 seconds. Params: {'eps': 1.9, 'min_samples': 100} Clusters: 4
Algorithm 95 of 100 finished in 6.43085241317749 seconds. Params: {'eps': 1.9, 'min_samples': 300} Clusters: 3
Algorithm 96 of 100 finished in 2.8422722816467285 seconds. Params: {'eps': 2.0, 'min_samples': 5} Clusters: 1
Algorithm 97 of 100 finished in 2.678576946258545 seconds. Params: {'eps': 2.0, 'min_samples': 20} Clusters: 1
Algorithm 98 of 100 finished in 2.6831469535827637 seconds. Params: {'eps': 2.0, 'min_samples': 50} Clusters: 1
Algorithm 99 of 100 finished in 2.9722986221313477 seconds. Params: {'eps': 2.0, 'min_samples': 100} Clusters: 1
Algorithm 100 of 100 finished in 3.031630277633667 seconds. Params: {'eps': 2.0, 'min_samples': 300} Clusters: 1
The best Silhouette score is for {'eps': 0.8999999999999999, 'min_samples': 50}, and its value is: 0.2709286512361466
The number of clusters for {'eps': 0.8999999999999999, 'min_samples': 50} is: 3
In [307]:
# 3D plot for the Silhouette score
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(eps_list, min_samples_list, silhouette, linewidth=0.2, antialiased=True)
Out[307]:
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x7fc0058ae048>
In [308]:
# 3D plot for the number of clusters
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(eps_list, min_samples_list, n_clusters, linewidth=0.2, antialiased=True)
Out[308]:
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x7fc01bb27b38>
In [309]:
results = pd.DataFrame([eps_list, min_samples_list, silhouette, n_clusters], 
                      index=['epsilon', 'min_samples', 'silhouette_score', 'n_clusters']).T
results.sort_values(by='silhouette_score', ascending=False)
Out[309]:
epsilon min_samples silhouette_score n_clusters
42 0.9 50.0 0.270929 3.0
64 1.3 300.0 0.268715 3.0
48 1.0 100.0 0.268510 3.0
59 1.2 300.0 0.267350 3.0
37 0.8 50.0 0.266011 3.0
53 1.1 100.0 0.265833 3.0
69 1.4 300.0 0.264174 3.0
43 0.9 100.0 0.264084 3.0
54 1.1 300.0 0.263121 3.0
32 0.7 50.0 0.261616 3.0
71 1.5 20.0 0.261561 3.0
70 1.5 5.0 0.261561 3.0
65 1.4 5.0 0.261561 3.0
58 1.2 100.0 0.261561 3.0
86 1.8 20.0 0.261561 3.0
63 1.3 100.0 0.261561 3.0
85 1.8 5.0 0.261561 3.0
68 1.4 100.0 0.261561 3.0
74 1.5 300.0 0.261561 3.0
90 1.9 5.0 0.261561 3.0
81 1.7 20.0 0.261561 3.0
80 1.7 5.0 0.261561 3.0
79 1.6 300.0 0.261561 3.0
94 1.9 300.0 0.261561 3.0
84 1.7 300.0 0.261561 3.0
75 1.6 5.0 0.261561 3.0
91 1.9 20.0 0.261561 3.0
92 1.9 50.0 0.261561 3.0
89 1.8 300.0 0.261561 3.0
87 1.8 50.0 0.261561 3.0
... ... ... ... ...
16 0.4 20.0 0.203455 4.0
21 0.5 20.0 0.197111 4.0
25 0.6 5.0 0.189156 6.0
30 0.7 5.0 0.186957 6.0
20 0.5 5.0 0.120458 10.0
29 0.6 300.0 0.115856 3.0
24 0.5 300.0 0.086803 2.0
12 0.3 50.0 0.032150 3.0
15 0.4 5.0 0.016198 19.0
95 2.0 5.0 0.000000 1.0
96 2.0 20.0 0.000000 1.0
97 2.0 50.0 0.000000 1.0
98 2.0 100.0 0.000000 1.0
99 2.0 300.0 0.000000 1.0
1 0.1 20.0 0.000000 1.0
19 0.4 300.0 0.000000 1.0
14 0.3 300.0 0.000000 1.0
9 0.2 300.0 0.000000 1.0
8 0.2 100.0 0.000000 1.0
7 0.2 50.0 0.000000 1.0
4 0.1 300.0 0.000000 1.0
3 0.1 100.0 0.000000 1.0
2 0.1 50.0 0.000000 1.0
18 0.4 100.0 -0.005402 4.0
10 0.3 5.0 -0.093571 29.0
11 0.3 20.0 -0.137059 10.0
13 0.3 100.0 -0.222862 3.0
6 0.2 20.0 -0.423905 18.0
5 0.2 5.0 -0.444531 151.0
0 0.1 5.0 -0.576810 278.0

100 rows × 4 columns

Let's try with the best values then.

In [310]:
# Create clusters
eps = 0.9
min_samples = 50
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.7709658535290097
Explained variance ratio for the first two components: 0.777857196298626
Explained variance ratio for the first two components: 0.7496109982227275
Out[310]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>

That seems to have divided the dataset by gender (including the option "Other"). Let's try other cases, with more clusters.

In [311]:
results.sort_values(by=['n_clusters', 'silhouette_score'], ascending=False).head(15)
Out[311]:
epsilon min_samples silhouette_score n_clusters
0 0.1 5.0 -0.576810 278.0
5 0.2 5.0 -0.444531 151.0
10 0.3 5.0 -0.093571 29.0
15 0.4 5.0 0.016198 19.0
6 0.2 20.0 -0.423905 18.0
20 0.5 5.0 0.120458 10.0
11 0.3 20.0 -0.137059 10.0
25 0.6 5.0 0.189156 6.0
30 0.7 5.0 0.186957 6.0
40 0.9 5.0 0.249391 4.0
56 1.2 20.0 0.248446 4.0
72 1.5 50.0 0.248446 4.0
78 1.6 100.0 0.247502 4.0
35 0.8 5.0 0.247436 4.0
50 1.1 5.0 0.247109 4.0

Case: eps=0.9 min_samples=5

In [312]:
# Create clusters
eps = 0.9
min_samples = 5
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.9241203453014921
Explained variance ratio for the first two components: 0.7779665053326819
Explained variance ratio for the first two components: 0.7438620996467973
Explained variance ratio for the first two components: 0.7517565478399462
Out[312]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>

Case: eps=0.3 min_samples=20

In [313]:
# Create clusters
eps = 0.3
min_samples = 20
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.6337215358909762
Explained variance ratio for the first two components: 0.7790962010609499
Explained variance ratio for the first two components: 0.7931983065505086
Explained variance ratio for the first two components: 0.8310929411692506
Explained variance ratio for the first two components: 0.8807107058136503
Explained variance ratio for the first two components: 0.7926583658507833
Explained variance ratio for the first two components: 0.8341852973002307
Explained variance ratio for the first two components: 0.8027881252083999
Explained variance ratio for the first two components: 0.8758623758746237
Explained variance ratio for the first two components: 0.9000444873902806
Out[313]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>

That looks like an interesting clustering!

Case: eps=0.2 min_samples=20

In [314]:
# Create clusters
eps = 0.2
min_samples = 20
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.6119520384426764
Explained variance ratio for the first two components: 0.9472907127322784
Explained variance ratio for the first two components: 0.8864734961693621
Explained variance ratio for the first two components: 0.7772043889203906
Explained variance ratio for the first two components: 0.8058796647899945
Explained variance ratio for the first two components: 0.8274914093577561
Explained variance ratio for the first two components: 0.9118101364880007
Explained variance ratio for the first two components: 0.7692847239798049
Explained variance ratio for the first two components: 0.7961287773167911
Explained variance ratio for the first two components: 0.7749008088311715
Explained variance ratio for the first two components: 0.816492789445995
Explained variance ratio for the first two components: 0.8447141841723623
Explained variance ratio for the first two components: 0.8317737564411434
Explained variance ratio for the first two components: 0.7507158890257389
Explained variance ratio for the first two components: 0.7767779892430311
Explained variance ratio for the first two components: 0.9654954539804872
Explained variance ratio for the first two components: 0.8746028063542287
Explained variance ratio for the first two components: 0.8719816254305199
Out[314]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>

I will keep the case with "eps = 0.3, min_samples = 20" because it seems "reasonable" (20 is a reasonable number for something to be "statistically relevant" and 0.3 is low enough; also 10 clusters is reasonable, and it detected a 3rd relevant cluster that is based on time of membership instead of gender).

5.A.g Analysis and Saving

Top

Let's first combine all the clustering labels as features, to look at them all together, and then save the resulting datasets. Some clustering algorithms are suitable for predicting the test set without retraining, others (such as hierarchical clustering) are not. The test set data shall not be used in the definition of the clusters so, in the cases in which retraining would be needed, a K-Nearest Neighbors classifier will be used to determine the cluster to which the test sample belongs. This is like, after labeling the training samples, we would define some regions based on the relative density of each label, and decide that those are the (fixed) "cluster boundaries". Then, when a test sample comes, just assign it to the cluster that corresponds to the space region in which the sample is.

To create the features and save the data the function "create_cluster_feats" will be used. The docstring can be read below.

In [342]:
print(clust.create_cluster_feats_4d.__doc__)
    Adds the features created by clustering for the selected 4D cases (age, income, gender, memeber_since_epoch).
    The features to add are: kmeans_8, ward_12 and dbscan_10.
    Args:
        static_dataset_path(str): The path to the static dataset to be taken as the initial data.
        output_path(str): The path to save the new dataset.
        save(boolean): Whether to save the new static dataset.
    Returns:
        static_cluster1_dataset(dataframe): The same as the static dataset but with the features added into new
            columns.
        X_train_r(dataframe): X_train (as obtained from time-split with the input static data) with the new features.
        X_test_r(dataframe): X_test (as obtained from time-split with the input static data) with the new features.
        y_train(pd.Series): y_train as obtained from time-split with the input static data.
        y_test(pd.Series): y_test as obtained from time-split with the input static data.
    
In [343]:
%time static_cluster1_dataset, X_train_r, X_test_r, y_train, y_test =\
    clust.create_cluster_feats_4d(save=True)
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

/home/miguel/github_repos/starbucks-advertising/src/features/clustering.py:139: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

/home/miguel/github_repos/starbucks-advertising/src/features/clustering.py:142: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

CPU times: user 12 s, sys: 669 ms, total: 12.7 s
Wall time: 12.7 s
In [344]:
X_train_r.head()
Out[344]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty ... offer_type reward_t channel_mobile channel_email channel_web channel_social expected_finish kmeans_8 ward_12 dbscan_10
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... informational 0.0 1.0 1.0 0.0 1.0 240.0 0.0 8.0 1.0
1 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... informational 0.0 1.0 1.0 1.0 0.0 432.0 0.0 8.0 1.0
2 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 33.0 2017-04-21 M 72000.0 0 17277 5.0 ... bogo 5.0 1.0 1.0 1.0 1.0 528.0 0.0 8.0 1.0
3 00116118485d4dfda04fdbaba9a87b5c 168 f19421c1d4aa40978ebb69ca19b0e20d NaN 2018-04-25 None NaN 1 17646 5.0 ... bogo 5.0 1.0 1.0 1.0 1.0 288.0 NaN NaN NaN
4 0011e0d4e6b944f998e987f904e8c1e5 0 3f207df678b143eea3cee63160fa8bed 40.0 2018-01-09 O 57000.0 0 17540 0.0 ... informational 0.0 1.0 1.0 1.0 0.0 96.0 0.0 9.0 -1.0

5 rows × 21 columns

In [345]:
sns.pairplot(X_train_r[['kmeans_8', 'ward_12', 'dbscan_10']])
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/numpy/lib/histograms.py:754: RuntimeWarning:

invalid value encountered in greater_equal

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/numpy/lib/histograms.py:755: RuntimeWarning:

invalid value encountered in less_equal

Out[345]:
<seaborn.axisgrid.PairGrid at 0x7fc00405e278>

5.B 3D Clustering (age, income, member_since)

Top

As it was seen, in the last part, that many clustering algorithms tend to divide mostly by gender, which is reasonable because the "gender" feature is discrete with only 3 possible values, I decided to try to cluster the customers without using the gender feature. One more advantage of this kind of clustering is that the samples can be visualized directly in 3D plot, without the need of using PCA. The original 3D plots were created with Plotly but, as the notebooks with Plotly 3D plots take a bit long to load, I decided to only include images of the original plots here.

In [346]:
%reset -f
In [347]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from time import time
from scipy.cluster.hierarchy import dendrogram, ward, fcluster
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.mixture import GaussianMixture
from sklearn.cluster import DBSCAN
from sklearn.neighbors import KNeighborsClassifier

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_cluster1.pkl')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.visualization.visualize as vis
import src.features.clustering as clust
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [348]:
# Get the data
X_train, X_test, y_train, y_test, encoder = sd.get_success_data(
    basic_dataset_path=STATIC_DATASET_PATH,
    drop_time=False,
    anon=False)

# Encode and filter relevant features
customer_feats = ['age', 'income', 'missing_demographics', 
                  'member_epoch_days']
offer_feats = ['difficulty', 'duration', 'offer_type', 'reward_t', 
               'channel_web', 'channel_social', 'channel_mobile', 
               'channel_email']

X_train_t = encoder.fit_transform(X_train)
X_train_t = X_train_t[customer_feats]
X_test_t = encoder.transform(X_test)
X_test_t = X_test_t[customer_feats]

# Drop duplicates and missing data
X_train_t = X_train_t.dropna().drop_duplicates()
X_test_t = X_test_t.dropna().drop_duplicates()

# Keep a copy with the original demographics
X_train_o = X_train_t.copy()
X_test_o = X_test_t.copy()

# Drop the irrelevant column
X_train_t = X_train_t.drop('missing_demographics', axis=1)
X_test_t = X_test_t.drop('missing_demographics', axis=1)

# Normalize
scaler = StandardScaler()
scaler.fit(X_train_t)

X_train_t = pd.DataFrame(scaler.transform(X_train_t),
                         index=X_train_t.index,
                         columns=X_train_t.columns)
X_test_t = pd.DataFrame(scaler.transform(X_test_t),
                         index=X_test_t.index,
                         columns=X_test_t.columns)

# Show X_train_t
X_train_t.head()
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/ipykernel_launcher.py:35: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/ipykernel_launcher.py:38: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

Out[348]:
age income member_epoch_days
0 -1.230818 0.305738 0.147407
7 -0.828193 -0.388567 0.774616
12 0.264647 1.138904 -0.837526
17 -1.748479 -0.249706 -0.236550
21 -1.633444 0.352025 0.292881

5.B.a Clustering EDA

Top

Let's visualize the data in 3D

Visualize 3D clusters

A function was created for this purpose (it uses plotly, that's why only the resulting images will be copied here). Below is the result for a 4 clusters K-means algorithm.

5.B.b K-Means

Top

In [349]:
clusters = [{'n_clusters': n} for n in range(2, 30)]
silhouette, error, best_n = clust.validate_clustering(X_train_t, KMeans, clusters, 
                                                      clust.kmeans_error, 'n_clusters')
Algorithm 1 of 28 finished in 4.069716930389404 seconds.
Algorithm 2 of 28 finished in 3.910810708999634 seconds.
Algorithm 3 of 28 finished in 3.956387758255005 seconds.
Algorithm 4 of 28 finished in 4.000387191772461 seconds.
Algorithm 5 of 28 finished in 4.503861904144287 seconds.
Algorithm 6 of 28 finished in 4.883861303329468 seconds.
Algorithm 7 of 28 finished in 4.997447490692139 seconds.
Algorithm 8 of 28 finished in 5.155400037765503 seconds.
Algorithm 9 of 28 finished in 4.872424364089966 seconds.
Algorithm 10 of 28 finished in 4.662208080291748 seconds.
Algorithm 11 of 28 finished in 5.074881315231323 seconds.
Algorithm 12 of 28 finished in 5.3537757396698 seconds.
Algorithm 13 of 28 finished in 7.304142475128174 seconds.
Algorithm 14 of 28 finished in 7.132237672805786 seconds.
Algorithm 15 of 28 finished in 6.288344383239746 seconds.
Algorithm 16 of 28 finished in 6.564389944076538 seconds.
Algorithm 17 of 28 finished in 8.207704067230225 seconds.
Algorithm 18 of 28 finished in 6.557082176208496 seconds.
Algorithm 19 of 28 finished in 6.612433195114136 seconds.
Algorithm 20 of 28 finished in 7.027898788452148 seconds.
Algorithm 21 of 28 finished in 7.008470058441162 seconds.
Algorithm 22 of 28 finished in 7.033693790435791 seconds.
Algorithm 23 of 28 finished in 6.828122138977051 seconds.
Algorithm 24 of 28 finished in 7.164654493331909 seconds.
Algorithm 25 of 28 finished in 6.999694585800171 seconds.
Algorithm 26 of 28 finished in 7.610993146896362 seconds.
Algorithm 27 of 28 finished in 7.70470118522644 seconds.
Algorithm 28 of 28 finished in 8.152178287506104 seconds.
The best Silhouette score is for {'n_clusters': 3}, and its value is: 0.3037735718267767
The error for {'n_clusters': 3} is: 22872.601704490942

OK, let's use KMeans with n = 3

In [350]:
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=2018)
kmeans.fit(X_train_t)
cluster_labels = kmeans.predict(X_train_t)

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.8583605312374512
Explained variance ratio for the first two components: 0.7807307089085307
Explained variance ratio for the first two components: 0.7402386617200014
Out[350]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [351]:
# visualize_3d_clusters(X_train_t, cluster_labels)

Description of the clusters:

  • 0) Young, low income customers who recently joined the app.
  • 1) Old, high income customers.
  • 2) Customers that joined the app a long time ago (early adopters).

5.B.c Hierarchical (Ward)

Top

In [352]:
%time linkage_matrix = ward(X_train_t)

p = 50
%time ddg = dendrogram(linkage_matrix, truncate_mode='lastp', p=p)
plt.hlines(40, 0, 10 * p, colors='y')
CPU times: user 9.66 s, sys: 533 ms, total: 10.2 s
Wall time: 10.2 s
CPU times: user 82.6 ms, sys: 0 ns, total: 82.6 ms
Wall time: 82.4 ms
Out[352]:
<matplotlib.collections.LineCollection at 0x7fc01c0414a8>

With that limit, maybe 9 clusters would be reasonable. Let's check that with the Silhouette score. But first, another visualization of the features and clusters.

In [353]:
%time sns.clustermap(X_train_t, figsize=(10, 20), method='ward', cmap='viridis')
CPU times: user 12.2 s, sys: 688 ms, total: 12.9 s
Wall time: 12.9 s
Out[353]:
<seaborn.matrix.ClusterGrid at 0x7fc01c4fccf8>
In [354]:
# cut_dists = np.arange(10, 150, 10)
cut_dists = np.exp(np.linspace(0, 5.5, 50))
results = defaultdict(dict)

# Cluster and gather results data
for i, max_d in enumerate(cut_dists):
    tic = time()
    clusters = fcluster(linkage_matrix, max_d, criterion='distance')
    n_clusters = len(np.unique(clusters))
    
    results['n_clusters'][max_d] = n_clusters
    if n_clusters >= 2:
        results['score'][max_d] = silhouette_score(X_train_t, clusters)
    else:
        results['score'][max_d] = 0
    
    toc = time()
    print('Algorithm {} of {} finished in {} seconds.'.format(
        i + 1, len(cut_dists), (toc - tic)))

    
# Show the results
best_d, best_score = max(results['score'].items(), key= lambda x: x[1])
best_n_clust = results['n_clusters'][best_d]
print(
    'The best Silhouette score is for {} clusters (maximum distance '.format(
    best_n_clust) + '= {}), and its value is: {}'.format(best_d, best_score))

# Plot silhouette score vs distance
m_dists = results['score'].keys()
scores = results['score'].values()
plt.plot(m_dists, scores)
plt.title('Silhouette score')
plt.vlines(best_d, min(scores), max(scores), 'r')
plt.xlabel('Maximum distance')

# Plot silhouette score vs n_clusters
plt.figure()
n, score_n = zip(*((results['n_clusters'][d], results['score'][d]) 
                   for d in results['n_clusters'].keys()))
plt.plot(n, score_n)
plt.title('Silhouette score')
plt.vlines(best_n_clust, min(score_n), max(score_n), 'r')
plt.xlabel('N clusters')

# Plot n_clusters vs distance
plt.figure()
m_dists = results['n_clusters'].keys()
n = results['n_clusters'].values()
plt.plot(m_dists, n)
plt.title('N clusters')
plt.xlabel('Max distance')
plt.ylabel('N')
plt.vlines(best_d, 0, max(n), 'r')
Algorithm 1 of 50 finished in 4.1940600872039795 seconds.
Algorithm 2 of 50 finished in 4.386510133743286 seconds.
Algorithm 3 of 50 finished in 4.026604890823364 seconds.
Algorithm 4 of 50 finished in 3.427976608276367 seconds.
Algorithm 5 of 50 finished in 3.4339451789855957 seconds.
Algorithm 6 of 50 finished in 3.42634916305542 seconds.
Algorithm 7 of 50 finished in 3.4180967807769775 seconds.
Algorithm 8 of 50 finished in 3.4466025829315186 seconds.
Algorithm 9 of 50 finished in 3.4405853748321533 seconds.
Algorithm 10 of 50 finished in 3.424112319946289 seconds.
Algorithm 11 of 50 finished in 3.4406282901763916 seconds.
Algorithm 12 of 50 finished in 3.957019329071045 seconds.
Algorithm 13 of 50 finished in 4.509637355804443 seconds.
Algorithm 14 of 50 finished in 4.185714483261108 seconds.
Algorithm 15 of 50 finished in 4.693584680557251 seconds.
Algorithm 16 of 50 finished in 5.06961464881897 seconds.
Algorithm 17 of 50 finished in 4.108058452606201 seconds.
Algorithm 18 of 50 finished in 4.369096517562866 seconds.
Algorithm 19 of 50 finished in 4.040919303894043 seconds.
Algorithm 20 of 50 finished in 3.9378840923309326 seconds.
Algorithm 21 of 50 finished in 3.798323631286621 seconds.
Algorithm 22 of 50 finished in 3.47440242767334 seconds.
Algorithm 23 of 50 finished in 3.383159637451172 seconds.
Algorithm 24 of 50 finished in 3.4808082580566406 seconds.
Algorithm 25 of 50 finished in 3.3951547145843506 seconds.
Algorithm 26 of 50 finished in 3.3932271003723145 seconds.
Algorithm 27 of 50 finished in 3.344393014907837 seconds.
Algorithm 28 of 50 finished in 3.3529281616210938 seconds.
Algorithm 29 of 50 finished in 4.514716386795044 seconds.
Algorithm 30 of 50 finished in 4.349892854690552 seconds.
Algorithm 31 of 50 finished in 3.9059817790985107 seconds.
Algorithm 32 of 50 finished in 4.250101327896118 seconds.
Algorithm 33 of 50 finished in 4.251211881637573 seconds.
Algorithm 34 of 50 finished in 4.685390949249268 seconds.
Algorithm 35 of 50 finished in 4.625629663467407 seconds.
Algorithm 36 of 50 finished in 4.8444578647613525 seconds.
Algorithm 37 of 50 finished in 4.996072053909302 seconds.
Algorithm 38 of 50 finished in 4.642498731613159 seconds.
Algorithm 39 of 50 finished in 4.338840484619141 seconds.
Algorithm 40 of 50 finished in 3.7568984031677246 seconds.
Algorithm 41 of 50 finished in 4.014132261276245 seconds.
Algorithm 42 of 50 finished in 3.98952317237854 seconds.
Algorithm 43 of 50 finished in 3.8057448863983154 seconds.
Algorithm 44 of 50 finished in 3.8185219764709473 seconds.
Algorithm 45 of 50 finished in 3.79876446723938 seconds.
Algorithm 46 of 50 finished in 0.047498464584350586 seconds.
Algorithm 47 of 50 finished in 0.04669189453125 seconds.
Algorithm 48 of 50 finished in 0.04369688034057617 seconds.
Algorithm 49 of 50 finished in 0.04098176956176758 seconds.
Algorithm 50 of 50 finished in 0.04089832305908203 seconds.
The best Silhouette score is for 3 clusters (maximum distance = 99.68755217120427), and its value is: 0.2724448929303689
Out[354]:
<matplotlib.collections.LineCollection at 0x7fc0161e8c50>

Zooming in the area of interest...

In [355]:
# Plot silhouette score vs n_clusters
n_max = 20
n, score_n = zip(*((results['n_clusters'][d], results['score'][d]) 
                   for d in results['n_clusters'].keys()))
n = np.array(n)
score_n = np.array(score_n)
plt.plot(n[n<n_max], score_n[n<n_max])
plt.title('Silhouette score')
plt.vlines(best_n_clust, min(score_n[n<n_max]), max(score_n[n<n_max]), 'r')
plt.xlabel('N clusters')
Out[355]:
Text(0.5, 0, 'N clusters')

Let's visualize the results

In [393]:
max_d = best_d
print(max_d)
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
99.68755217120427
Explained variance ratio for the first two components: 0.7841100769293812
Explained variance ratio for the first two components: 0.8283633163861489
Explained variance ratio for the first two components: 0.7526873693206712
Out[393]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [357]:
# visualize_3d_clusters(X_s_train, cluster_labels)

It's similar, but not the same, as the kmeans clustering. Let's try with more clusters.

In [358]:
res_df = pd.DataFrame(results)
res_df.index.name = 'distance'
res_df = res_df.reset_index().sort_values(by='n_clusters')
res_df.head(30)
Out[358]:
distance n_clusters score
49 244.691932 1 0.000000
45 156.181752 1 0.000000
48 218.711848 1 0.000000
47 195.490191 1 0.000000
46 174.734085 1 0.000000
44 139.599207 2 0.250901
43 124.777308 2 0.250901
42 111.529119 3 0.272445
41 99.687552 3 0.272445
40 89.103260 4 0.238488
39 79.642751 4 0.238488
38 71.186708 4 0.238488
37 63.628484 4 0.238488
35 50.834305 7 0.219468
36 56.872751 7 0.219468
34 45.436990 9 0.200243
33 40.612733 9 0.200243
32 36.300689 11 0.186094
31 32.446476 13 0.188251
30 29.001482 14 0.187229
29 25.922260 15 0.189981
28 23.169972 19 0.195897
27 20.709908 22 0.186276
26 18.511041 25 0.177970
25 16.545637 29 0.179333
24 14.788910 31 0.172764
23 13.218702 36 0.156338
22 11.815211 46 0.162190
21 10.560735 51 0.166698
20 9.439452 60 0.170015

n_clusters = 19

In [394]:
dist_19 = res_df[res_df.n_clusters == 19].distance.values[0]

max_d = dist_19
print(max_d)
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
23.16997229826248
Explained variance ratio for the first two components: 0.866549042122249
Explained variance ratio for the first two components: 0.7604900679600395
Explained variance ratio for the first two components: 0.8269616370586979
Explained variance ratio for the first two components: 0.8735532206085663
Explained variance ratio for the first two components: 0.8709935444660412
Explained variance ratio for the first two components: 0.8495645051638883
Explained variance ratio for the first two components: 0.8394682322759035
Explained variance ratio for the first two components: 0.8485187161954266
Explained variance ratio for the first two components: 0.8395101747564646
Explained variance ratio for the first two components: 0.7955958497403162
Explained variance ratio for the first two components: 0.795513973510334
Explained variance ratio for the first two components: 0.7881763572946476
Explained variance ratio for the first two components: 0.8522167029862139
Explained variance ratio for the first two components: 0.7914118010666675
Explained variance ratio for the first two components: 0.8099828560074727
Explained variance ratio for the first two components: 0.7953277647958493
Explained variance ratio for the first two components: 0.831568448100672
Explained variance ratio for the first two components: 0.8764767742914314
Explained variance ratio for the first two components: 0.775372370798532
Out[394]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [360]:
# visualize_3d_clusters(X_s_train, cluster_labels)

What about 9?

n_clusters = 9

In [395]:
dist_9 = res_df[res_df.n_clusters == 9].distance.values[0]

max_d = dist_9
print(max_d)
cluster_labels = fcluster(linkage_matrix, max_d, criterion='distance')
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
45.436989981397055
Explained variance ratio for the first two components: 0.8665490421222485
Explained variance ratio for the first two components: 0.8701653491467216
Explained variance ratio for the first two components: 0.7318380916847664
Explained variance ratio for the first two components: 0.8952750605726705
Explained variance ratio for the first two components: 0.8182347964247951
Explained variance ratio for the first two components: 0.8793619840177034
Explained variance ratio for the first two components: 0.8468146181018206
Explained variance ratio for the first two components: 0.8764767742914314
Explained variance ratio for the first two components: 0.7753723707985316
Out[395]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [362]:
# visualize_3d_clusters(X_s_train, cluster_labels)

There seem to be many non-trivial clusterings. I will keep the three cases (3, 9, and 19 clusters)

5.B.d GMM

Top

In [363]:
clusters = [{'n_components': n} for n in range(2, 30)]
silhouette, aic, best_n = clust.validate_clustering(X_train_t, GaussianMixture, clusters, 
                                                    clust.gmm_aic, 'n_components')
print('The best case, based on the Silhouette score in for {}'.format(best_n))
Algorithm 1 of 28 finished in 3.88468074798584 seconds.
Algorithm 2 of 28 finished in 3.8152592182159424 seconds.
Algorithm 3 of 28 finished in 3.5756871700286865 seconds.
Algorithm 4 of 28 finished in 3.8774614334106445 seconds.
Algorithm 5 of 28 finished in 3.8292531967163086 seconds.
Algorithm 6 of 28 finished in 3.7678353786468506 seconds.
Algorithm 7 of 28 finished in 3.9361519813537598 seconds.
Algorithm 8 of 28 finished in 4.231604814529419 seconds.
Algorithm 9 of 28 finished in 4.002884149551392 seconds.
Algorithm 10 of 28 finished in 3.745725154876709 seconds.
Algorithm 11 of 28 finished in 3.855477809906006 seconds.
Algorithm 12 of 28 finished in 3.932504415512085 seconds.
Algorithm 13 of 28 finished in 4.29953932762146 seconds.
Algorithm 14 of 28 finished in 3.9742448329925537 seconds.
Algorithm 15 of 28 finished in 3.8399863243103027 seconds.
Algorithm 16 of 28 finished in 4.040496826171875 seconds.
Algorithm 17 of 28 finished in 4.454254150390625 seconds.
Algorithm 18 of 28 finished in 4.360625982284546 seconds.
Algorithm 19 of 28 finished in 4.2007246017456055 seconds.
Algorithm 20 of 28 finished in 3.9610142707824707 seconds.
Algorithm 21 of 28 finished in 4.119219779968262 seconds.
Algorithm 22 of 28 finished in 4.761711120605469 seconds.
Algorithm 23 of 28 finished in 4.93835711479187 seconds.
Algorithm 24 of 28 finished in 4.48497200012207 seconds.
Algorithm 25 of 28 finished in 4.524496078491211 seconds.
Algorithm 26 of 28 finished in 4.847931861877441 seconds.
Algorithm 27 of 28 finished in 4.3072474002838135 seconds.
Algorithm 28 of 28 finished in 4.331759214401245 seconds.
The best Silhouette score is for {'n_components': 3}, and its value is: 0.26233253657713057
The error for {'n_components': 3} is: 116186.90006289177
The best case, based on the Silhouette score in for {'n_components': 3}
In [364]:
pd.DataFrame([clusters, aic, silhouette], index=['params', 'aic', 'silhouette']).T
Out[364]:
params aic silhouette
0 {'n_components': 2} 120784 0.244615
1 {'n_components': 3} 116187 0.262333
2 {'n_components': 4} 115598 0.232364
3 {'n_components': 5} 115166 0.236865
4 {'n_components': 6} 115097 0.22837
5 {'n_components': 7} 114712 0.22841
6 {'n_components': 8} 114429 0.225467
7 {'n_components': 9} 114003 0.247602
8 {'n_components': 10} 113903 0.238859
9 {'n_components': 11} 113893 0.239107
10 {'n_components': 12} 113718 0.237365
11 {'n_components': 13} 113537 0.246058
12 {'n_components': 14} 113693 0.246559
13 {'n_components': 15} 113571 0.247126
14 {'n_components': 16} 113525 0.244259
15 {'n_components': 17} 113446 0.242433
16 {'n_components': 18} 113136 0.229831
17 {'n_components': 19} 113117 0.233874
18 {'n_components': 20} 113052 0.231321
19 {'n_components': 21} 113301 0.242149
20 {'n_components': 22} 113065 0.233761
21 {'n_components': 23} 112984 0.230527
22 {'n_components': 24} 112959 0.229924
23 {'n_components': 25} 112926 0.231036
24 {'n_components': 26} 112857 0.225847
25 {'n_components': 27} 112611 0.22022
26 {'n_components': 28} 112917 0.223034
27 {'n_components': 29} 112810 0.228188

Let's check the 3 clusters case, and the 16 clusters case.

3 clusters

In [365]:
gmm = GaussianMixture(n_components=3)
gmm.fit(X_train_t)
cluster_labels = gmm.predict(X_train_t)

vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.7903645522376377
Explained variance ratio for the first two components: 0.8079214686215597
Explained variance ratio for the first two components: 0.9590132663513165
Out[365]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [366]:
# visualize_3d_clusters(X_s_train, cluster_labels)

16 clusters

In [367]:
gmm = GaussianMixture(n_components=16)
gmm.fit(X_train_t)
cluster_labels = gmm.predict(X_train_t)

vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.822032754558504
Explained variance ratio for the first two components: 0.8225284574918053
Explained variance ratio for the first two components: 0.7636428525536204
Explained variance ratio for the first two components: 0.7788041988165122
Explained variance ratio for the first two components: 0.7918349020329725
Explained variance ratio for the first two components: 0.7461996469134407
Explained variance ratio for the first two components: 0.799798197241993
Explained variance ratio for the first two components: 0.7176867151788222
Explained variance ratio for the first two components: 0.739475147392625
Explained variance ratio for the first two components: 0.7845318362276235
Explained variance ratio for the first two components: 0.7918386190905231
Explained variance ratio for the first two components: 0.7600903097027623
Explained variance ratio for the first two components: 0.7444472800946353
Explained variance ratio for the first two components: 0.7650785529971499
Explained variance ratio for the first two components: 0.8141522296903682
Explained variance ratio for the first two components: 0.8950661826906228
Out[367]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [368]:
# visualize_3d_clusters(X_s_train, cluster_labels)

5.B.e DBSCAN

Top

Let's grid search for DBSCAN parameters.

In [369]:
# Define DBSCAN parameters
n_batch = 20
m_batch = 5
eps_list = np.linspace(0.1, 2.0, n_batch).repeat(m_batch)
min_samples_list = [5, 20, 50, 100, 300] * n_batch
params = [{'eps': e, 'min_samples': ms} for e, ms in zip(eps_list, min_samples_list)]

# Search and gather the results
silhouette = list()
n_clusters = list()
for i, param_set in enumerate(params):
    tic = time()
    method = DBSCAN(**param_set)
    method.fit(X_train_t)
    labels = method.labels_
    try:
        silhouette.append(silhouette_score(X_train_t, labels))
    except ValueError:
        silhouette.append(0)
    n_clusters.append(len(np.unique(labels)))
    toc = time()
    print('Algorithm {} of {} finished in {} seconds. Params: {} Clusters: {}'.format(
        i + 1, len(params), (toc - tic), param_set, len(np.unique(labels))))
    
# Show the results
best_silhouette_params = params[np.argmax(silhouette)]
print('The best Silhouette score is for {}, and its value is: {}'.format(
    best_silhouette_params, max(silhouette)))
print('The number of clusters for {} is: {}'.format(
    best_silhouette_params, n_clusters[np.argmax(silhouette)]))
Algorithm 1 of 100 finished in 3.7083120346069336 seconds. Params: {'eps': 0.1, 'min_samples': 5} Clusters: 352
Algorithm 2 of 100 finished in 0.11442780494689941 seconds. Params: {'eps': 0.1, 'min_samples': 20} Clusters: 1
Algorithm 3 of 100 finished in 0.11713814735412598 seconds. Params: {'eps': 0.1, 'min_samples': 50} Clusters: 1
Algorithm 4 of 100 finished in 0.11674213409423828 seconds. Params: {'eps': 0.1, 'min_samples': 100} Clusters: 1
Algorithm 5 of 100 finished in 0.11516547203063965 seconds. Params: {'eps': 0.1, 'min_samples': 300} Clusters: 1
Algorithm 6 of 100 finished in 3.963618516921997 seconds. Params: {'eps': 0.2, 'min_samples': 5} Clusters: 71
Algorithm 7 of 100 finished in 3.7124240398406982 seconds. Params: {'eps': 0.2, 'min_samples': 20} Clusters: 11
Algorithm 8 of 100 finished in 3.940795421600342 seconds. Params: {'eps': 0.2, 'min_samples': 50} Clusters: 6
Algorithm 9 of 100 finished in 0.161285400390625 seconds. Params: {'eps': 0.2, 'min_samples': 100} Clusters: 1
Algorithm 10 of 100 finished in 0.1664581298828125 seconds. Params: {'eps': 0.2, 'min_samples': 300} Clusters: 1
Algorithm 11 of 100 finished in 4.240633726119995 seconds. Params: {'eps': 0.3, 'min_samples': 5} Clusters: 7
Algorithm 12 of 100 finished in 4.172931909561157 seconds. Params: {'eps': 0.3, 'min_samples': 20} Clusters: 6
Algorithm 13 of 100 finished in 3.857574701309204 seconds. Params: {'eps': 0.3, 'min_samples': 50} Clusters: 3
Algorithm 14 of 100 finished in 3.8562662601470947 seconds. Params: {'eps': 0.3, 'min_samples': 100} Clusters: 3
Algorithm 15 of 100 finished in 0.2287454605102539 seconds. Params: {'eps': 0.3, 'min_samples': 300} Clusters: 1
Algorithm 16 of 100 finished in 4.20611047744751 seconds. Params: {'eps': 0.4, 'min_samples': 5} Clusters: 2
Algorithm 17 of 100 finished in 4.18147087097168 seconds. Params: {'eps': 0.4, 'min_samples': 20} Clusters: 3
Algorithm 18 of 100 finished in 4.312440633773804 seconds. Params: {'eps': 0.4, 'min_samples': 50} Clusters: 2
Algorithm 19 of 100 finished in 4.594228982925415 seconds. Params: {'eps': 0.4, 'min_samples': 100} Clusters: 2
Algorithm 20 of 100 finished in 4.915706157684326 seconds. Params: {'eps': 0.4, 'min_samples': 300} Clusters: 2
Algorithm 21 of 100 finished in 4.938894510269165 seconds. Params: {'eps': 0.5, 'min_samples': 5} Clusters: 2
Algorithm 22 of 100 finished in 4.544705629348755 seconds. Params: {'eps': 0.5, 'min_samples': 20} Clusters: 2
Algorithm 23 of 100 finished in 5.160315036773682 seconds. Params: {'eps': 0.5, 'min_samples': 50} Clusters: 3
Algorithm 24 of 100 finished in 4.743298292160034 seconds. Params: {'eps': 0.5, 'min_samples': 100} Clusters: 2
Algorithm 25 of 100 finished in 4.375377416610718 seconds. Params: {'eps': 0.5, 'min_samples': 300} Clusters: 2
Algorithm 26 of 100 finished in 0.5866818428039551 seconds. Params: {'eps': 0.6, 'min_samples': 5} Clusters: 1
Algorithm 27 of 100 finished in 4.803210496902466 seconds. Params: {'eps': 0.6, 'min_samples': 20} Clusters: 2
Algorithm 28 of 100 finished in 4.646209001541138 seconds. Params: {'eps': 0.6, 'min_samples': 50} Clusters: 2
Algorithm 29 of 100 finished in 5.079421520233154 seconds. Params: {'eps': 0.6, 'min_samples': 100} Clusters: 2
Algorithm 30 of 100 finished in 4.737025499343872 seconds. Params: {'eps': 0.6, 'min_samples': 300} Clusters: 2
Algorithm 31 of 100 finished in 0.7192583084106445 seconds. Params: {'eps': 0.7, 'min_samples': 5} Clusters: 1
Algorithm 32 of 100 finished in 0.8612222671508789 seconds. Params: {'eps': 0.7, 'min_samples': 20} Clusters: 1
Algorithm 33 of 100 finished in 5.09521484375 seconds. Params: {'eps': 0.7, 'min_samples': 50} Clusters: 2
Algorithm 34 of 100 finished in 4.74887752532959 seconds. Params: {'eps': 0.7, 'min_samples': 100} Clusters: 2
Algorithm 35 of 100 finished in 4.512999534606934 seconds. Params: {'eps': 0.7, 'min_samples': 300} Clusters: 2
Algorithm 36 of 100 finished in 0.8440749645233154 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 5} Clusters: 1
Algorithm 37 of 100 finished in 0.8542897701263428 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 20} Clusters: 1
Algorithm 38 of 100 finished in 5.348203182220459 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 50} Clusters: 2
Algorithm 39 of 100 finished in 5.325645208358765 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 100} Clusters: 2
Algorithm 40 of 100 finished in 5.209382057189941 seconds. Params: {'eps': 0.7999999999999999, 'min_samples': 300} Clusters: 2
Algorithm 41 of 100 finished in 1.0373337268829346 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 5} Clusters: 1
Algorithm 42 of 100 finished in 1.4079399108886719 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 20} Clusters: 1
Algorithm 43 of 100 finished in 1.2849984169006348 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 50} Clusters: 1
Algorithm 44 of 100 finished in 1.0112173557281494 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 100} Clusters: 1
Algorithm 45 of 100 finished in 4.825592041015625 seconds. Params: {'eps': 0.8999999999999999, 'min_samples': 300} Clusters: 2
Algorithm 46 of 100 finished in 1.299941062927246 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 5} Clusters: 1
Algorithm 47 of 100 finished in 1.1924681663513184 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 20} Clusters: 1
Algorithm 48 of 100 finished in 1.3133485317230225 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 50} Clusters: 1
Algorithm 49 of 100 finished in 1.353177547454834 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 100} Clusters: 1
Algorithm 50 of 100 finished in 5.540476083755493 seconds. Params: {'eps': 0.9999999999999999, 'min_samples': 300} Clusters: 2
Algorithm 51 of 100 finished in 1.5551323890686035 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 5} Clusters: 1
Algorithm 52 of 100 finished in 1.3863720893859863 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 20} Clusters: 1
Algorithm 53 of 100 finished in 1.4744606018066406 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 50} Clusters: 1
Algorithm 54 of 100 finished in 1.3405439853668213 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 100} Clusters: 1
Algorithm 55 of 100 finished in 5.461179256439209 seconds. Params: {'eps': 1.0999999999999999, 'min_samples': 300} Clusters: 2
Algorithm 56 of 100 finished in 1.8363227844238281 seconds. Params: {'eps': 1.2, 'min_samples': 5} Clusters: 1
Algorithm 57 of 100 finished in 1.589221477508545 seconds. Params: {'eps': 1.2, 'min_samples': 20} Clusters: 1
Algorithm 58 of 100 finished in 1.5775303840637207 seconds. Params: {'eps': 1.2, 'min_samples': 50} Clusters: 1
Algorithm 59 of 100 finished in 1.4720978736877441 seconds. Params: {'eps': 1.2, 'min_samples': 100} Clusters: 1
Algorithm 60 of 100 finished in 1.6073203086853027 seconds. Params: {'eps': 1.2, 'min_samples': 300} Clusters: 1
Algorithm 61 of 100 finished in 1.7123627662658691 seconds. Params: {'eps': 1.3, 'min_samples': 5} Clusters: 1
Algorithm 62 of 100 finished in 1.6810765266418457 seconds. Params: {'eps': 1.3, 'min_samples': 20} Clusters: 1
Algorithm 63 of 100 finished in 1.7054417133331299 seconds. Params: {'eps': 1.3, 'min_samples': 50} Clusters: 1
Algorithm 64 of 100 finished in 1.7054882049560547 seconds. Params: {'eps': 1.3, 'min_samples': 100} Clusters: 1
Algorithm 65 of 100 finished in 1.7353870868682861 seconds. Params: {'eps': 1.3, 'min_samples': 300} Clusters: 1
Algorithm 66 of 100 finished in 2.1252918243408203 seconds. Params: {'eps': 1.4, 'min_samples': 5} Clusters: 1
Algorithm 67 of 100 finished in 2.2938952445983887 seconds. Params: {'eps': 1.4, 'min_samples': 20} Clusters: 1
Algorithm 68 of 100 finished in 2.1183247566223145 seconds. Params: {'eps': 1.4, 'min_samples': 50} Clusters: 1
Algorithm 69 of 100 finished in 2.0612547397613525 seconds. Params: {'eps': 1.4, 'min_samples': 100} Clusters: 1
Algorithm 70 of 100 finished in 2.563123941421509 seconds. Params: {'eps': 1.4, 'min_samples': 300} Clusters: 1
Algorithm 71 of 100 finished in 2.1419355869293213 seconds. Params: {'eps': 1.5, 'min_samples': 5} Clusters: 1
Algorithm 72 of 100 finished in 2.1011128425598145 seconds. Params: {'eps': 1.5, 'min_samples': 20} Clusters: 1
Algorithm 73 of 100 finished in 2.369995594024658 seconds. Params: {'eps': 1.5, 'min_samples': 50} Clusters: 1
Algorithm 74 of 100 finished in 2.129686117172241 seconds. Params: {'eps': 1.5, 'min_samples': 100} Clusters: 1
Algorithm 75 of 100 finished in 2.2644102573394775 seconds. Params: {'eps': 1.5, 'min_samples': 300} Clusters: 1
Algorithm 76 of 100 finished in 2.3932957649230957 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 5} Clusters: 1
Algorithm 77 of 100 finished in 2.148869752883911 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 20} Clusters: 1
Algorithm 78 of 100 finished in 2.1100964546203613 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 50} Clusters: 1
Algorithm 79 of 100 finished in 2.1282598972320557 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 100} Clusters: 1
Algorithm 80 of 100 finished in 2.1010334491729736 seconds. Params: {'eps': 1.5999999999999999, 'min_samples': 300} Clusters: 1
Algorithm 81 of 100 finished in 2.388908624649048 seconds. Params: {'eps': 1.7, 'min_samples': 5} Clusters: 1
Algorithm 82 of 100 finished in 2.391378879547119 seconds. Params: {'eps': 1.7, 'min_samples': 20} Clusters: 1
Algorithm 83 of 100 finished in 2.4029572010040283 seconds. Params: {'eps': 1.7, 'min_samples': 50} Clusters: 1
Algorithm 84 of 100 finished in 2.4698541164398193 seconds. Params: {'eps': 1.7, 'min_samples': 100} Clusters: 1
Algorithm 85 of 100 finished in 2.3549001216888428 seconds. Params: {'eps': 1.7, 'min_samples': 300} Clusters: 1
Algorithm 86 of 100 finished in 2.5110886096954346 seconds. Params: {'eps': 1.8, 'min_samples': 5} Clusters: 1
Algorithm 87 of 100 finished in 2.535844326019287 seconds. Params: {'eps': 1.8, 'min_samples': 20} Clusters: 1
Algorithm 88 of 100 finished in 2.830930233001709 seconds. Params: {'eps': 1.8, 'min_samples': 50} Clusters: 1
Algorithm 89 of 100 finished in 2.9124984741210938 seconds. Params: {'eps': 1.8, 'min_samples': 100} Clusters: 1
Algorithm 90 of 100 finished in 2.6453990936279297 seconds. Params: {'eps': 1.8, 'min_samples': 300} Clusters: 1
Algorithm 91 of 100 finished in 2.8488128185272217 seconds. Params: {'eps': 1.9, 'min_samples': 5} Clusters: 1
Algorithm 92 of 100 finished in 2.9553725719451904 seconds. Params: {'eps': 1.9, 'min_samples': 20} Clusters: 1
Algorithm 93 of 100 finished in 2.798936128616333 seconds. Params: {'eps': 1.9, 'min_samples': 50} Clusters: 1
Algorithm 94 of 100 finished in 2.7262449264526367 seconds. Params: {'eps': 1.9, 'min_samples': 100} Clusters: 1
Algorithm 95 of 100 finished in 2.9945268630981445 seconds. Params: {'eps': 1.9, 'min_samples': 300} Clusters: 1
Algorithm 96 of 100 finished in 3.62034273147583 seconds. Params: {'eps': 2.0, 'min_samples': 5} Clusters: 1
Algorithm 97 of 100 finished in 2.8561418056488037 seconds. Params: {'eps': 2.0, 'min_samples': 20} Clusters: 1
Algorithm 98 of 100 finished in 2.7692711353302 seconds. Params: {'eps': 2.0, 'min_samples': 50} Clusters: 1
Algorithm 99 of 100 finished in 2.735328197479248 seconds. Params: {'eps': 2.0, 'min_samples': 100} Clusters: 1
Algorithm 100 of 100 finished in 2.792435884475708 seconds. Params: {'eps': 2.0, 'min_samples': 300} Clusters: 1
The best Silhouette score is for {'eps': 1.0999999999999999, 'min_samples': 300}, and its value is: 0.40133519297931325
The number of clusters for {'eps': 1.0999999999999999, 'min_samples': 300} is: 2
In [370]:
# 3D plot for the Silhouette score
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(eps_list, min_samples_list, silhouette, linewidth=0.2, antialiased=True)
Out[370]:
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x7fc005af1ac8>
In [371]:
# 3D plot for the number of clusters
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_trisurf(eps_list, min_samples_list, n_clusters, linewidth=0.2, antialiased=True)
Out[371]:
<mpl_toolkits.mplot3d.art3d.Poly3DCollection at 0x7fbff9bdb9b0>
In [372]:
results = pd.DataFrame([eps_list, min_samples_list, silhouette, n_clusters], 
                      index=['epsilon', 'min_samples', 'silhouette_score', 'n_clusters']).T
results.sort_values(by='silhouette_score', ascending=False)
Out[372]:
epsilon min_samples silhouette_score n_clusters
54 1.1 300.0 0.401335 2.0
20 0.5 5.0 0.378045 2.0
49 1.0 300.0 0.374592 2.0
26 0.6 20.0 0.367140 2.0
44 0.9 300.0 0.362180 2.0
32 0.7 50.0 0.356342 2.0
39 0.8 300.0 0.352348 2.0
38 0.8 100.0 0.349873 2.0
37 0.8 50.0 0.346309 2.0
21 0.5 20.0 0.344791 2.0
33 0.7 100.0 0.342820 2.0
28 0.6 100.0 0.342782 2.0
15 0.4 5.0 0.339281 2.0
34 0.7 300.0 0.327967 2.0
27 0.6 50.0 0.327157 2.0
23 0.5 100.0 0.316816 2.0
17 0.4 50.0 0.306223 2.0
29 0.6 300.0 0.273077 2.0
22 0.5 50.0 0.268793 3.0
16 0.4 20.0 0.242933 3.0
18 0.4 100.0 0.229789 2.0
24 0.5 300.0 0.210101 2.0
11 0.3 20.0 0.132512 6.0
10 0.3 5.0 0.048641 7.0
12 0.3 50.0 0.009374 3.0
75 1.6 5.0 0.000000 1.0
77 1.6 50.0 0.000000 1.0
78 1.6 100.0 0.000000 1.0
76 1.6 20.0 0.000000 1.0
50 1.1 5.0 0.000000 1.0
... ... ... ... ...
14 0.3 300.0 0.000000 1.0
25 0.6 5.0 0.000000 1.0
30 0.7 5.0 0.000000 1.0
31 0.7 20.0 0.000000 1.0
35 0.8 5.0 0.000000 1.0
36 0.8 20.0 0.000000 1.0
40 0.9 5.0 0.000000 1.0
61 1.3 20.0 0.000000 1.0
42 0.9 50.0 0.000000 1.0
41 0.9 20.0 0.000000 1.0
45 1.0 5.0 0.000000 1.0
58 1.2 100.0 0.000000 1.0
47 1.0 50.0 0.000000 1.0
48 1.0 100.0 0.000000 1.0
1 0.1 20.0 0.000000 1.0
51 1.1 20.0 0.000000 1.0
52 1.1 50.0 0.000000 1.0
53 1.1 100.0 0.000000 1.0
60 1.3 5.0 0.000000 1.0
55 1.2 5.0 0.000000 1.0
46 1.0 20.0 0.000000 1.0
56 1.2 20.0 0.000000 1.0
57 1.2 50.0 0.000000 1.0
59 1.2 300.0 0.000000 1.0
13 0.3 100.0 -0.026446 3.0
19 0.4 300.0 -0.071646 2.0
7 0.2 50.0 -0.346422 6.0
6 0.2 20.0 -0.358352 11.0
5 0.2 5.0 -0.371585 71.0
0 0.1 5.0 -0.529961 352.0

100 rows × 4 columns

Let's try with the best values then.

In [373]:
# Create clusters
eps = 1.1
min_samples = 300
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.9956885824192225
Explained variance ratio for the first two components: 0.7690947812515505
Out[373]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [374]:
# visualize_3d_clusters(X_s_train, cluster_labels)

In [375]:
results.sort_values(by=['n_clusters', 'silhouette_score'], ascending=False).head(15)
Out[375]:
epsilon min_samples silhouette_score n_clusters
0 0.1 5.0 -0.529961 352.0
5 0.2 5.0 -0.371585 71.0
6 0.2 20.0 -0.358352 11.0
10 0.3 5.0 0.048641 7.0
11 0.3 20.0 0.132512 6.0
7 0.2 50.0 -0.346422 6.0
22 0.5 50.0 0.268793 3.0
16 0.4 20.0 0.242933 3.0
12 0.3 50.0 0.009374 3.0
13 0.3 100.0 -0.026446 3.0
54 1.1 300.0 0.401335 2.0
20 0.5 5.0 0.378045 2.0
49 1.0 300.0 0.374592 2.0
26 0.6 20.0 0.367140 2.0
44 0.9 300.0 0.362180 2.0

Case: eps=0.2 min_samples=20

In [376]:
# Create clusters
eps = 0.2
min_samples = 20
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.7679912417684338
Explained variance ratio for the first two components: 0.8459062977547702
Explained variance ratio for the first two components: 0.8497584855573694
Explained variance ratio for the first two components: 0.8181324247320817
Explained variance ratio for the first two components: 0.7943894579077522
Explained variance ratio for the first two components: 0.8147385909849846
Explained variance ratio for the first two components: 0.7368009561354175
Explained variance ratio for the first two components: 0.7918115483801549
Explained variance ratio for the first two components: 0.9297101798515962
Explained variance ratio for the first two components: 0.7732491345980056
Explained variance ratio for the first two components: 0.8745033358044687
Out[376]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [377]:
# visualize_3d_clusters(X_s_train, cluster_labels)

It seems to have found a "core" dense area, and a less dense outer zone. Also some small "satellite" clusters...

Case: eps=0.3 min_samples=5

In [378]:
# Create clusters
eps = 0.3
min_samples = 5
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.8638013964800584
Explained variance ratio for the first two components: 0.7687633537319105
Explained variance ratio for the first two components: 0.942017763993274
Explained variance ratio for the first two components: 0.9852012837812405
Explained variance ratio for the first two components: 0.9873006684035786
Explained variance ratio for the first two components: 0.9029194167131287
Explained variance ratio for the first two components: 0.8890779155726101
Out[378]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [379]:
# visualize_3d_clusters(X_s_train, cluster_labels)

Case: eps=0.2 min_samples=50

In [380]:
# Create clusters
eps = 0.2
min_samples = 50
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.7678869140153082
Explained variance ratio for the first two components: 0.8393043746654593
Explained variance ratio for the first two components: 0.9022011686611264
Explained variance ratio for the first two components: 0.8097539533017521
Explained variance ratio for the first two components: 0.7466004257843629
Explained variance ratio for the first two components: 0.7892480107092928
Out[380]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [381]:
# visualize_3d_clusters(X_s_train, cluster_labels)

Case: eps=0.5 min_samples=50

In [382]:
# Create clusters
eps = 0.5
min_samples = 50
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.8806149401209554
Explained variance ratio for the first two components: 0.7571566093747192
Explained variance ratio for the first two components: 0.8045191731839905
Out[382]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [383]:
# visualize_3d_clusters(X_s_train, cluster_labels)

Case: eps=0.5 min_samples=100

In [384]:
# Create clusters
eps = 0.5
min_samples = 100
dbs = DBSCAN(eps=eps, min_samples=min_samples)
dbs.fit(X_train_t)
cluster_labels = dbs.labels_

# Show the results
vis.pca_visualize_clusters(X_train_t, cluster_labels)

plt.figure()
sns.heatmap(X_train_t.groupby(cluster_labels).mean())

plt.figure()
X_train_t.groupby(cluster_labels).mean().plot(kind='barh')

plt.figure()
X_train_t.groupby(cluster_labels).count().age.sort_values().plot(kind='barh')
plt.title('Number of members per cluster')
Explained variance ratio for the first two components: 0.851726574873007
Explained variance ratio for the first two components: 0.7803964340791891
Out[384]:
Text(0.5, 1.0, 'Number of members per cluster')
<Figure size 1440x720 with 0 Axes>
In [385]:
# visualize_3d_clusters(X_s_train, cluster_labels)

This seems to divide a dense core from the periphery... could be useful.

I will keep the cases with (eps, num_samples) = (0.2, 20) and (eps, num_samples) = (0.5, 100), that seem to produce interesting clusters.

5.B.f Analysis and Saving

Top

The function to save the dataset with the new features is "create_cluster_feats_3d". Its docstring can be seen below.

In [401]:
print(clust.create_cluster_feats_3d.__doc__)
    Adds the features created by clustering for the selected 3D cases (age, income, memeber_since_epoch).
    The features to add are: 3d_kmeans_3, 3d_ward_3, 3d_ward_19, 3d_gmm_3, 3d_gmm_16, 3d_dbscan_02_20, 3d_dbscan_05_100
    Args:
        static_dataset_path(str): The path to the static dataset to be taken as the initial data.
        output_path(str): The path to save the new dataset.
        save(boolean): Whether to save the new static dataset.
    Returns:
        static_cluster3d_dataset(dataframe): The same as the static dataset but with the features added into new
            columns.
        X_train_r(dataframe): X_train (as obtained from time-split with the input static data) with the new features.
        X_test_r(dataframe): X_test (as obtained from time-split with the input static data) with the new features.
        y_train(pd.Series): y_train as obtained from time-split with the input static data.
        y_test(pd.Series): y_test as obtained from time-split with the input static data.
    
In [396]:
%time static_cluster3d_dataset, X_train_r, X_test_r, y_train, y_test =\
    clust.create_cluster_feats_3d(save=True)
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/preprocessing/data.py:625: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

/home/miguel/github_repos/starbucks-advertising/src/features/clustering.py:242: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

/home/miguel/github_repos/starbucks-advertising/src/features/clustering.py:245: DataConversionWarning:

Data with input dtype int64, float64 were all converted to float64 by StandardScaler.

CPU times: user 12.8 s, sys: 730 ms, total: 13.5 s
Wall time: 13.5 s
In [397]:
X_train_r.head()
Out[397]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty ... ward_12 dbscan_10 3d_kmeans_3 3d_ward_3 3d_ward_9 3d_ward_19 3d_gmm_3 3d_gmm_16 3d_dbscan_02_20 3d_dbscan_05_100
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... 8.0 1.0 0.0 1.0 3.0 4.0 2.0 11.0 -1.0 0.0
1 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... 8.0 1.0 0.0 1.0 3.0 4.0 2.0 11.0 -1.0 0.0
2 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 33.0 2017-04-21 M 72000.0 0 17277 5.0 ... 8.0 1.0 0.0 1.0 3.0 4.0 2.0 11.0 -1.0 0.0
3 00116118485d4dfda04fdbaba9a87b5c 168 f19421c1d4aa40978ebb69ca19b0e20d NaN 2018-04-25 None NaN 1 17646 5.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0011e0d4e6b944f998e987f904e8c1e5 0 3f207df678b143eea3cee63160fa8bed 40.0 2018-01-09 O 57000.0 0 17540 0.0 ... 9.0 -1.0 0.0 1.0 4.0 8.0 2.0 11.0 0.0 0.0

5 rows × 29 columns

In [398]:
cluster_feats = ['3d_kmeans_3', '3d_ward_3', '3d_ward_9', '3d_ward_19',
                 '3d_gmm_3', '3d_gmm_16', '3d_dbscan_02_20', '3d_dbscan_05_100']
sns.pairplot(X_train_r[cluster_feats])
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/numpy/lib/histograms.py:754: RuntimeWarning:

invalid value encountered in greater_equal

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/numpy/lib/histograms.py:755: RuntimeWarning:

invalid value encountered in less_equal

Out[398]:
<seaborn.axisgrid.PairGrid at 0x7fbffe03cb00>

6 Lagged Features

Top

The idea of lagged features is to use past values of target features as features in the predictor. As the period of time in the simulation is very short, there are not many possible lagged features, but in any case, some of them can be calculated.

In [176]:
%reset -f
In [177]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_cluster3d.pkl')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.features.lagged as lag
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [178]:
data = pd.read_pickle(STATIC_DATASET_PATH)
print(data.shape)
data.head()
(76277, 39)
Out[178]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty ... ward_12 dbscan_10 3d_kmeans_3 3d_ward_3 3d_ward_9 3d_ward_19 3d_gmm_3 3d_gmm_16 3d_dbscan_02_20 3d_dbscan_05_100
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... 8.0 1.0 0.0 1.0 3.0 4.0 0.0 4.0 -1.0 0.0
1 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... 8.0 1.0 0.0 1.0 3.0 4.0 0.0 4.0 -1.0 0.0
2 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 33.0 2017-04-21 M 72000.0 0 17277 5.0 ... 8.0 1.0 0.0 1.0 3.0 4.0 0.0 4.0 -1.0 0.0
3 0009655768c64bdeb2e877511632db8f 504 fafdcd668e3743c1bb461111dcafc2a4 33.0 2017-04-21 M 72000.0 0 17277 10.0 ... 8.0 1.0 0.0 1.0 3.0 4.0 0.0 4.0 -1.0 0.0
4 0009655768c64bdeb2e877511632db8f 576 2906b810c7d4411798c6938adc9daaa5 33.0 2017-04-21 M 72000.0 0 17277 10.0 ... 8.0 1.0 0.0 1.0 3.0 4.0 0.0 4.0 -1.0 0.0

5 rows × 39 columns

In [179]:
data.columns
Out[179]:
Index(['person', 'time', 'offer_id', 'age', 'became_member_on', 'gender',
       'income', 'missing_demographics', 'member_epoch_days', 'difficulty',
       'duration', 'offer_type', 'reward_t', 'channel_mobile', 'channel_email',
       'channel_web', 'channel_social', 'completed', 'expected_finish',
       'finish', 'success', 'view_time', 'viewed', 'actual_reward',
       'profit_in_duration', 'profit_until_complete', 'spent_in_duration',
       'spent_until_complete', 'kmeans_8', 'ward_12', 'dbscan_10',
       '3d_kmeans_3', '3d_ward_3', '3d_ward_9', '3d_ward_19', '3d_gmm_3',
       '3d_gmm_16', '3d_dbscan_02_20', '3d_dbscan_05_100'],
      dtype='object')

Three functions were created to obtain the wanted lagged features. Their docstring can be seen below

In [180]:
print(lag.fill_one_lagged_success.__doc__)
    For a given time, and a given user, it counts how many times each offer was shown,
    and how many of those were a success (the rate of success could be easily calculated
    afterwards).
    offer_id_n: keeps track of how many times the offer was shown
    offer_id_success: keeps track of how many times the offer was successful
    
In [181]:
print(lag.fill_user_lagged_success.__doc__)
 Fills the lagged success features for all the records in one customer. 
In [182]:
print(lag.fill_lagged_success.__doc__)
 Fills the lagged success features for the entire dataset. 

Let's create the features

In [183]:
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), 
                         orient='records', lines=True)
In [184]:
%time data_lag = lag.fill_lagged_success(data, portfolio)
CPU times: user 9min 32s, sys: 865 ms, total: 9min 32s
Wall time: 9min 32s
In [185]:
print(data_lag.columns)
Index(['person', 'time', 'offer_id', 'age', 'became_member_on', 'gender',
       'income', 'missing_demographics', 'member_epoch_days', 'difficulty',
       ...
       'difficulty_7.0_success_ratio', 'reward_10.0_success_ratio',
       'reward_0.0_success_ratio', 'reward_5.0_success_ratio',
       'reward_3.0_success_ratio', 'reward_2.0_success_ratio',
       'channel_web_success_ratio', 'channel_email_success_ratio',
       'channel_social_success_ratio', 'channel_mobile_success_ratio'],
      dtype='object', length=135)

Let's look at the resulting success ratios

In [186]:
data_lag.filter(regex='.*_ratio').describe()
Out[186]:
ae264e3637204a6fb9bb56bc8210ddfd_success_ratio 4d5c57ea9a6940dd891ad53e9dbe8da0_success_ratio 3f207df678b143eea3cee63160fa8bed_success_ratio 9b98b8c7a33c4b65b9aebfe6a799e6d9_success_ratio 0b1e1539f2cc45b7b9fa7c272da2e1d7_success_ratio 2298d6c36e964ae4a3e7e9706d1fb8c2_success_ratio fafdcd668e3743c1bb461111dcafc2a4_success_ratio 5a8bc65990b245e5a138643cd4eb9837_success_ratio f19421c1d4aa40978ebb69ca19b0e20d_success_ratio 2906b810c7d4411798c6938adc9daaa5_success_ratio ... difficulty_7.0_success_ratio reward_10.0_success_ratio reward_0.0_success_ratio reward_5.0_success_ratio reward_3.0_success_ratio reward_2.0_success_ratio channel_web_success_ratio channel_email_success_ratio channel_social_success_ratio channel_mobile_success_ratio
count 76277.000000 76277.000000 76277.0 76277.000000 76277.000000 76277.000000 76277.000000 76277.0 76277.000000 76277.000000 ... 76277.000000 76277.000000 76277.0 76277.000000 76277.000000 76277.000000 76277.000000 76277.000000 76277.000000 76277.000000
mean 0.060752 0.061554 0.0 0.052214 0.049980 0.097540 0.109875 0.0 0.078607 0.053941 ... 0.097540 0.110455 0.0 0.147644 0.097540 0.148621 0.267862 0.257108 0.256908 0.253037
std 0.236452 0.238293 0.0 0.219411 0.215179 0.294205 0.310737 0.0 0.266397 0.222797 ... 0.294205 0.306856 0.0 0.339739 0.294205 0.347324 0.382153 0.357700 0.394048 0.362347
min 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 ... 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.499998 0.499998 0.499998 0.499998
max 0.999998 0.999997 0.0 0.999997 0.999997 0.999997 0.999997 0.0 0.999997 0.999997 ... 0.999997 0.999998 0.0 0.999998 0.999997 0.999998 0.999998 0.999998 0.999998 0.999998

8 rows × 32 columns

Save the data

In [187]:
data_lag.to_pickle(os.path.join(DATA_PROCESSED, 'static_cluster_lagged.pkl'))

7 The "Profit in 10 days" problem statement

Top

The idea is to try to predict the profits that a customer will generate in the 10 days after an offer is sent to them. The 10 days period was selected for two reasons.

First it is the longest period of "duration" for an offer, so if a shorter period was selected that could mean a very important disadvantage to those offers that are "more attractive" to the customer towards the end of their duration (there could be such offers).

Second, the "effect" period of the offers seems to be larger than their duration (as seen from the spending-time chart), and that makes a lot of sense, since some offers wouldn't be reasonable if that wasn't the case (for examples BOGO offers, look like they would be unprofitable unless they have a longer term effect).

One of the problems to calculate the profit in 10 days is that I don't have the production costs. That makes it impossible to really get the profits numbers, so they were approximated as the total spending of the customer in the period minus the reward that was given (if any).

The other problem that we have, is the overlapping of effects between consecutive offers. That is studied below. In this first approach, it was ignored, hoping that, statistically, it would, more or less, cancel. That problem could be addressed in several times, for example, by calculating the "profit per day, in the period without overlapping, inside the 10 days period". But that would require other hypothesis, like "the effects of the offers don't vary too much in their duration" (so that taking the mean of a sub-period is a, more or less, accurate measurement of the total mean). But, as there was a time deadline, that possibility wasn't explored this time.

One final issue, that complicates things a bit, is the fact that the simulation data we have is for a very short period of time. That means that we cannot use a time-based validation scheme with many time-splits. A test set was separated, but no more than that.

One particularity of this problem is that we have to consider the possibility of not sending any offer at all. That will be called the "null offer", and will be dealt with.

Solution description: The idea is to create two Machine Learning estimators:

  • The first estimator is a Classifier that predicts if an offer will or will not be seen. It must be able to predict the probability of an offer to be seen (not just classify).
  • The second estimator is Regressor that predicts the expected profits for a customer, in the following 10 days of an offer reception that was viewed, or in the case that no offer was viewed.

The final estimator combines both estimators by multiplying the probabilities from the first, with the expected profits from the second.

In [188]:
%reset -f
In [189]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_cluster_lagged.pkl')


import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.data.profit_10_days_dataset as p10
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [190]:
# Read the data
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)
static_data = pd.read_pickle(STATIC_DATASET_PATH)

# Basic Preprocessing
print('Basic preprocessing')
%time data, portfolio = pp.basic_preprocessing(portfolio, profile, transcript)

# Split
received, viewed, completed, transactions = pp.split_transcript(data)
Basic preprocessing
CPU times: user 1.91 s, sys: 52.1 ms, total: 1.96 s
Wall time: 1.96 s

We have to differentiate the case when the customer has viewed an offer in the last 10 days, from the case in which it hasn't.

In [191]:
static_data.viewed.mean()
Out[191]:
0.7502523696527131

So, 75% of the sent offers are seen. The remaining 25% will be used as samples of the "null offer" (it would be possible to get more samples by looking at all the periods in which there are no seen offers, but that would complicate things and doesn't seem necessary).

Let's look at the mean spending in time

In [192]:
# Create empty time series
ts = pd.DataFrame(np.zeros((data.time.nunique(), data.person.nunique())), 
                  index=data.time.unique(), columns=data.person.unique())

# Add the transactions data
ts.update(transactions.pivot_table(index='time',
                              columns='person',
                              values='amount',
                              aggfunc='sum'))

sending_times = received.time.unique()
for sent in sending_times:
    plt.vlines(sent, ts.mean(axis=1).min(), ts.mean(axis=1).max(), colors='r')
    plt.text(sent, ts.mean(axis=1).max()*0.8, 'sent!', dict(color='r', size=16))
ts.mean(axis=1).plot()

print('The offers were sent at these times: {}'.format(sending_times))
The offers were sent at these times: [  0 168 336 408 504 576]

How big is the difference in time between consecutive offers?

In [193]:
sending_times[1:] - sending_times[:-1]
Out[193]:
array([168, 168,  72,  96,  72])

It can be seen that the effect of offers last for, at least about 170 hours, that is 7 days, but possibly more. Let's take a 10 days period as the "relevant" period for all offers. If we took a different period for each offer we would be comparing them in an unfair way.

Let's look at the overlapping issue

A function was created to show the "relevant" periods for the offers of a customer. It can show the offers that get seen or all the offers.

In [194]:
# Get the data for one user alone
user = static_data[static_data.person == static_data.person[1]]

interesting_offers = user.offer_id.tolist() + ['no_offer']
offer_times = p10.get_offers_ts(user, portfolio, data, viewed=False)
In [195]:
offer_times.loc[:, interesting_offers].plot()
plt.title('All the offers\' relevant periods for one user')
Out[195]:
Text(0.5, 1.0, "All the offers' relevant periods for one user")
In [196]:
p10.get_offers_ts(user, portfolio, data, viewed=True).loc[:, interesting_offers].plot()
plt.title('Only the viewed offers\' relevant periods for one user')
Out[196]:
Text(0.5, 1.0, "Only the viewed offers' relevant periods for one user")

It is very clear that the overlapping exists, and may be relevant (less so if we consider only the viewed offers). The assumption will be that an offer will randomly overlap with all the others, and the effect will be, to a certain point, cancelled, statistically.

If the offers, in mean, behave similarly in their full period of interest, a model could be implemented on a day-to-day basis, and the overlapping zones could be ignored, or taken into account in an easier way, but that won't be explored here.

Let's get the spent money in 10 days following the offer reception, without taking into account the overlapping or the views.

A function was implemented to add the amount spent by each customer in the following 10 days after the reception of an offer. Running it in the full dataset takes about 25 minutes. If the dataset was already created it will just load it (much faster).

In [197]:
# The function below generates the profit 10 days static dataset
STATIC_SPENT_10_DAYS = os.path.join(DATA_PROCESSED, 'static_spent_10_days.pkl')

if os.path.exists(STATIC_SPENT_10_DAYS):
    filled = pd.read_pickle(STATIC_SPENT_10_DAYS)
else:
    filled = p10.get_spent_days_static(static_data, data)
    filled.to_pickle(STATIC_SPENT_10_DAYS)
    
print(filled.shape)
filled.head()
(76277, 136)
Out[197]:
person time offer_id age became_member_on gender income missing_demographics member_epoch_days difficulty ... reward_10.0_success_ratio reward_0.0_success_ratio reward_5.0_success_ratio reward_3.0_success_ratio reward_2.0_success_ratio channel_web_success_ratio channel_email_success_ratio channel_social_success_ratio channel_mobile_success_ratio spent_10_days
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 22.16
1 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 33.0 2017-04-21 M 72000.0 0 17277 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 46.51
2 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 33.0 2017-04-21 M 72000.0 0 17277 5.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 46.51
3 0009655768c64bdeb2e877511632db8f 504 fafdcd668e3743c1bb461111dcafc2a4 33.0 2017-04-21 M 72000.0 0 17277 10.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 96.87
4 0009655768c64bdeb2e877511632db8f 576 2906b810c7d4411798c6938adc9daaa5 33.0 2017-04-21 M 72000.0 0 17277 10.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 58.93

5 rows × 136 columns

How is the spending distributed among the offers?

In [198]:
filled.groupby('offer_id').spent_10_days.sum().sort_values().plot(kind='barh')
Out[198]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbff9f08320>

And what's the mean spending per offfer id?

In [199]:
filled.groupby('offer_id').spent_10_days.mean().sort_values().plot(kind='barh')
Out[199]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc015fce748>

What about the mean spending for each kind of offer?

In [200]:
filled.groupby('offer_type').spent_10_days.mean().sort_values().plot(kind='barh')
Out[200]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbffe51f748>

Interesting, it looks like the "discount" and "bogo" offers tend to produce a bit more of income.

Creating the profits dataset

After collecting the information about the money spent by customers there is still work to do until we have a usable dataset.

In particular, the "null offer" shall be added, and that makes the dataset a bit tricky. The "null offer" shall be added when predicting the profits for a viewed offer, but not when predicting the probability of a view to happen (because the null offer is filled when the offers were not seen, and that's what we want to determine in that case). To address that, some PROFIT_COLS and VIEW_COLS were created, that contain differentiated offer information. The VIEW_COLS are a copy of the offer information columns, while the PROFIT_COLS are also a copy but sometimes contain the information of the null offer.

Below are the docstrings of "fill_null_offer" and "get_profit_10_days_data". It is also included "split_view_profit" that gets one dataset (e.g.: X_train) and divides it in the parts needed for the "views prediction" and for the "viewed offer profits prediction".

Finally, an extension of the BasicEncoder was created, that accepts new custom features to LabelEncode (it was needed for the "offer_id" feature, and some others that finally were not used).

In [201]:
print(p10.fill_null_offer.__doc__)
    Fill the 'null' offer data when an offer was not viewed.
    The 'viewcol' features are generated for the views predictor. That model
    doesn't consider the null offer because it predicts the views themselves.
    Args:
        data(dataframe): A dataframe with sent offers (like the result from 'get_spent_days_static').

    Returns:
        data(dataframe): Like the input but with added / modified columns.
        view_cols(list): The names of the columns for the 'views' estimator.
        profit_cols(list): The names of the columns for the 'profits' estimator.
    
In [202]:
print(p10.get_profit_10_days_data.__doc__)
    Generates the dataset to predict the profits in 10 days for each offer.
    The profits are calculated as the money spent minus the paid reward (if any).
    Args:
        basic_dataset_path(str): The path to the pickle containing the basic
            dataset.
        train_times(list): A list (or tuple) with the time values for the training set.
        test_times(list): A list (or tuple) with the time values for the test set.
        drop_time(boolean): Whether to drop or not the absolute time dependent features.
        anon_person(boolean): Whether to drop or not unique identifiers to customers.
        drop_offer_id(boolean): Whether to drop or not the 'offer_id' feature.
        target(str): The target feature name (typically, 'viewed' or 'profit_10_days', or both).

    Returns:
        X_train(pd.DataFrame): The training dataset.
        X_test(pd.DataFrame): The test dataset.
        y_train(pd.Series): The training target.
        y_test(pd.Series): The test target.
        encoder(BasicEncoderProfits): An encoder to use in an ML pipeline.
        view_cols(list): The names of the columns for the 'views' estimator.
        profit_cols(list): The names of the columns for the 'profits' estimator.
    
In [203]:
print(p10.BasicEncoderProfits.__doc__)
    Transforms the Basic dataset. Adds the possibility of encoding other custom features, like offer_id,
    for example.
    Args:
        custom_features(list): Names of the custom features to label-encode.
    

Below, the datasets are generated and shown. Note that "y" has 2 columns, one for each estimator.

In [204]:
# Get the data
X_train, X_test, y_train, y_test, encoder, view_cols, profit_cols =\
p10.get_profit_10_days_data(fill_null=True, 
                        target=['viewed', 'profit_10_days'], drop_offer_id=False)

print(X_train.shape)
print(y_train.shape)
X_train.head()
(25319, 130)
(25319, 2)
Out[204]:
offer_id age gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t ... channel_mobile_success_ratio difficulty_viewcol duration_viewcol reward_t_viewcol channel_web_viewcol channel_mobile_viewcol channel_email_viewcol channel_social_viewcol offer_id_viewcol offer_type_viewcol
0 5a8bc65990b245e5a138643cd4eb9837 33.0 M 72000.0 0 17277 0.0 3.0 informational 0.0 ... 0.0 0.0 3.0 0.0 0.0 1.0 1.0 1.0 5a8bc65990b245e5a138643cd4eb9837 informational
5 f19421c1d4aa40978ebb69ca19b0e20d NaN None NaN 1 17646 5.0 5.0 bogo 5.0 ... 0.0 5.0 5.0 5.0 1.0 1.0 1.0 1.0 f19421c1d4aa40978ebb69ca19b0e20d bogo
7 3f207df678b143eea3cee63160fa8bed 40.0 O 57000.0 0 17540 0.0 4.0 informational 0.0 ... 0.0 0.0 4.0 0.0 1.0 1.0 1.0 0.0 3f207df678b143eea3cee63160fa8bed informational
8 2298d6c36e964ae4a3e7e9706d1fb8c2 40.0 O 57000.0 0 17540 7.0 7.0 discount 3.0 ... 0.0 7.0 7.0 3.0 1.0 1.0 1.0 1.0 2298d6c36e964ae4a3e7e9706d1fb8c2 discount
12 fafdcd668e3743c1bb461111dcafc2a4 59.0 F 90000.0 0 16864 10.0 10.0 discount 2.0 ... 0.0 10.0 10.0 2.0 1.0 1.0 1.0 1.0 fafdcd668e3743c1bb461111dcafc2a4 discount

5 rows × 130 columns

In [205]:
y_train.head(15)
Out[205]:
viewed profit_10_days
0 1 22.16
5 1 0.70
7 1 13.49
8 1 8.93
12 1 96.33
13 0 0.00
17 1 58.12
21 1 3.99
22 1 14.45
26 1 37.29
27 1 27.75
31 0 64.89
34 1 154.18
39 1 4.39
40 1 9.71
In [206]:
print('View columns: \n{}\n'.format(view_cols))
print('Profit columns: \n{}'.format(profit_cols))
View columns: 
['difficulty_viewcol', 'duration_viewcol', 'reward_t_viewcol', 'channel_web_viewcol', 'channel_mobile_viewcol', 'channel_email_viewcol', 'channel_social_viewcol', 'offer_id_viewcol', 'offer_type_viewcol']

Profit columns: 
['difficulty', 'duration', 'reward_t', 'channel_web', 'channel_mobile', 'channel_email', 'channel_social', 'offer_id', 'offer_type']

Profits Predictor

Then a profits predictor was created, that contains 2 sklearn Pipelines: one to predict the view probabilities and another to predict the profits given a viewed offer. Both are combined in the predictor to give a final result. The ProfitsPredictor is a BaseEstimator and RegressorMixin, and follows scikit-learn conventions.

In the "Experiments and Results" section a complete training and evaluation example can be seen.

In [207]:
print(p10.ProfitsPredictor.__doc__)
    Predicts the profits in 10 days for any given offer to a specific customer. It uses to models (sklearn Pipelines):
    A classifier to predict the probability of an offer being viewed, and a regressor to predict the expected profit
    that a customer will generate in the 10 days following the reception of an offer. Both results are combined to give
    a total expected profit returns in 10 days, after the reception of an offer.
    Args:
        encoder(BasicEncoderProfits): An encoder to use in an ML pipeline.
        view_cols(list): The names of the columns for the 'views' estimator.
        profit_cols(list): The names of the columns for the 'profits' estimator.
    

Best offer choosing functions

After having a predictor for the expected profits in 10 days, some functions are required to decide which offer will be shown to each customer.

To do that "predict_profit_with_offer" can be used to get the predictions for every sample in the dataset, but changing the "offer features" for those of a particular offer.

Then, "choose_offer" loops around all possible offers in the portfolio, including an added "null offer", and picks the best for each customer in each case (note that the same customer may get different offers in different moments in time because, even if the "absolute time features" were dropped [assuming stationarity], we are considering "lagged" features that will change with the customer's history).

Those functions will be seen in action when an AB test "quasi-experiment" is run in the "Experiments and Results" section.

The docstrings are below.

In [208]:
print(p10.predict_profit_with_offer.__doc__)
    Predicts how much will be the profit in 10 days for a given an offer.
    Args:
        model(ProfitsPredictor): The model to estimate the profits in 10 days.
        data(dataframe): A static dataset, like the result of 'get_profit_10_days_data' (X_train, X_test, ...).
        offer(pd.Series): One row of the portfolio dataframe.
        drop_offer_id(boolean): Whether to drop or not the 'offer_id' column.

    Returns:
        predictions(pd.Series): The predicted profits for the offer and for each sample in 'data'.
    
In [209]:
print(p10.choose_offer.__doc__)
    Given a model and a features dataframe it returns the offers that maximize the model predictions.
    It calls 'predict_profit_with_offer' for each offer in portfolio, and selects the one with the largest
    predicted profit.
    Args:
        model(ProfitsPredictor): The model to estimate the profits in 10 days.
        X(dataframe): A static dataset, like the result of 'get_profit_10_days_data' (X_train, X_test, ...).
        portfolio(dataframe): The processed portfolio dataframe. Like the result from 'pp.basic_preprocessing'.
        add_null_offer(boolean): Whether to add the null offer (no offer at all) to the portfolio.

    Returns:
        pd.Series: A series with the offer_id of the selected (best) offer for each sample of X.
    

8 Experiments and Results

Top

8.A Offer Success experiments

Top

Many experiments were run, with different features and hyperparameters. Below, the most successful one will be reproduced. To perform the "Grid Search" that lead to this result, took about 37 minutes, that's why it will be omitted here, but a boolean (grid_search) can be set to do it fully.

In [210]:
%reset -f
In [211]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.visualization.visualize as vis
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Get the data and show it

In [212]:
# Get the data
STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_cluster_lagged.pkl')

X_train_val, X_test, y_train_val, y_test, encoder = sd.get_success_data(
    basic_dataset_path=STATIC_DATASET_PATH,
    drop_time=False)

# Time-split validation datasets
X_test = pp.drop_time_dependent(X_test)
X_train, X_val, y_train, y_val = sd.time_split(X_train_val, 
                                               y_train_val,
                                               time_limit=370)
In [213]:
print(X_train.shape)
print(y_train.shape)
X_train.head()
(38030, 120)
(38030,)
Out[213]:
age gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t channel_mobile ... difficulty_7.0_success_ratio reward_10.0_success_ratio reward_0.0_success_ratio reward_5.0_success_ratio reward_3.0_success_ratio reward_2.0_success_ratio channel_web_success_ratio channel_email_success_ratio channel_social_success_ratio channel_mobile_success_ratio
0 33.0 M 72000.0 0 17277 0.0 3.0 informational 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 33.0 M 72000.0 0 17277 0.0 4.0 informational 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 NaN None NaN 1 17646 5.0 5.0 bogo 5.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 40.0 O 57000.0 0 17540 0.0 4.0 informational 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 40.0 O 57000.0 0 17540 7.0 7.0 discount 3.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 120 columns

In [214]:
print(X_val.shape)
print(y_val.shape)
X_val.head()
(12778, 120)
(12778,)
Out[214]:
age gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t channel_mobile ... difficulty_7.0_success_ratio reward_10.0_success_ratio reward_0.0_success_ratio reward_5.0_success_ratio reward_3.0_success_ratio reward_2.0_success_ratio channel_web_success_ratio channel_email_success_ratio channel_social_success_ratio channel_mobile_success_ratio
2 33.0 M 72000.0 0 17277 5.0 5.0 bogo 5.0 1.0 ... 0.00000 0.0 0.0 0.00000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
10 40.0 O 57000.0 0 17540 20.0 10.0 discount 5.0 0.0 ... 0.99999 0.0 0.0 0.00000 0.99999 0.000000 0.499998 0.333332 0.499998 0.333332
15 59.0 F 90000.0 0 16864 10.0 5.0 bogo 10.0 1.0 ... 0.00000 0.0 0.0 0.00000 0.00000 0.999995 0.999995 0.666664 0.666664 0.666664
19 24.0 F 60000.0 0 17116 0.0 3.0 informational 0.0 1.0 ... 0.99999 0.0 0.0 0.99999 0.99999 0.000000 0.999995 0.999995 0.999995 0.999995
24 26.0 F 73000.0 0 17338 10.0 10.0 discount 2.0 1.0 ... 0.00000 0.0 0.0 0.00000 0.00000 0.999990 0.499998 0.333332 0.499998 0.333332

5 rows × 120 columns

In [215]:
print(X_test.shape)
print(y_test.shape)
X_test.head()
(25469, 120)
(25469,)
Out[215]:
age gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t channel_mobile ... difficulty_7.0_success_ratio reward_10.0_success_ratio reward_0.0_success_ratio reward_5.0_success_ratio reward_3.0_success_ratio reward_2.0_success_ratio channel_web_success_ratio channel_email_success_ratio channel_social_success_ratio channel_mobile_success_ratio
3 33.0 M 72000.0 0 17277 10.0 10.0 discount 2.0 1.0 ... 0.00000 0.000000 0.0 0.00000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
4 33.0 M 72000.0 0 17277 10.0 7.0 discount 2.0 1.0 ... 0.00000 0.000000 0.0 0.00000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
6 NaN None NaN 1 17646 5.0 5.0 bogo 5.0 1.0 ... 0.00000 0.000000 0.0 0.00000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
11 40.0 O 57000.0 0 17540 5.0 7.0 bogo 5.0 1.0 ... 0.99999 0.000000 0.0 0.99999 0.99999 0.000000 0.666664 0.499999 0.499998 0.333332
16 59.0 F 90000.0 0 16864 0.0 3.0 informational 0.0 1.0 ... 0.00000 0.499998 0.0 0.00000 0.00000 0.999995 0.999997 0.749998 0.749998 0.749998

5 rows × 120 columns

Create the model

In [216]:
cluster1_feats = ['kmeans_8', 'ward_12', 'dbscan_10']
cluster3d_feats = ['3d_kmeans_3', '3d_ward_3', '3d_ward_9', '3d_ward_19',
                 '3d_gmm_3', '3d_gmm_16', '3d_dbscan_02_20', '3d_dbscan_05_100']

base_model = Pipeline([
    ('encoder', pp.BasicEncoder()),
    ('imputer', md.BasicImputer(fill_mode=cluster1_feats + cluster3d_feats)),
    ('estimator', XGBClassifier(max_depth=7, n_estimators=200, n_thread=0,
                                random_state=2018))
])

Toggle grid search

Top

In [217]:
# Grid search for better parameters (to do so, set "grid_search" to True; 
# it can take up to 40 minutes, more or less).
grid_search = True

if grid_search:
    parameters = {
        'estimator__max_depth': [4, 7],
        'estimator__n_estimators': [10, 200, 500],
        'estimator__subsample': [0.5, 1.0],
        'estimator__colsample_bytree': [0.5, 0.7, 1.0],
        'estimator__colsample_bylevel': [0.5, 0.7, 1.0]
    }
    cv = GridSearchCV(base_model, parameters, cv=3, n_jobs=-1)

    %time cv.fit(X_train, y_train)

    print('The best parameters are:')
    print(cv.best_params_)
    print('-'*100)

    model = cv.best_estimator_
    model.get_params()
else:
    model = Pipeline([
        ('encoder', pp.BasicEncoder()),
        ('imputer', md.BasicImputer(fill_mode=cluster1_feats + cluster3d_feats)),
        ('estimator', XGBClassifier(max_depth=4, n_estimators=200, 
                                    subsample=1.0, colsample_bytree=0.5,
                                    colsample_bylevel=1.0, random_state=2018))
    ])
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

CPU times: user 18.4 s, sys: 1.05 s, total: 19.5 s
Wall time: 37min 38s
The best parameters are:
{'estimator__colsample_bylevel': 1.0, 'estimator__colsample_bytree': 0.7, 'estimator__max_depth': 4, 'estimator__n_estimators': 200, 'estimator__subsample': 1.0}
----------------------------------------------------------------------------------------------------

Evaluate the model

Time-split Validation

In [218]:
trained_model, y_train_pred, y_val_pred = evos.time_split_validation(model, 
                                                                     basic_dataset_path=STATIC_DATASET_PATH)
Training time: 14.29697847366333 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[15423  4709]
 [ 5104 12794]]
Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.77      0.76     20132
           1       0.73      0.71      0.72     17898

   micro avg       0.74      0.74      0.74     38030
   macro avg       0.74      0.74      0.74     38030
weighted avg       0.74      0.74      0.74     38030

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[4576 2167]
 [1505 4530]]
Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.68      0.71      6743
           1       0.68      0.75      0.71      6035

   micro avg       0.71      0.71      0.71     12778
   macro avg       0.71      0.71      0.71     12778
weighted avg       0.72      0.71      0.71     12778

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.7115928369462771 |
---------------------------------------------------

Customer-split validation

In [219]:
evos.random_1fold_cust_validation(model, basic_dataset_path=STATIC_DATASET_PATH)
Training time: 13.410612344741821 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[14314  4469]
 [ 4731 12047]]
Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.76      0.76     18783
           1       0.73      0.72      0.72     16778

   micro avg       0.74      0.74      0.74     35561
   macro avg       0.74      0.74      0.74     35561
weighted avg       0.74      0.74      0.74     35561

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[6036 2056]
 [2104 5051]]
Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.75      0.74      8092
           1       0.71      0.71      0.71      7155

   micro avg       0.73      0.73      0.73     15247
   macro avg       0.73      0.73      0.73     15247
weighted avg       0.73      0.73      0.73     15247

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.7083158042350302 |
---------------------------------------------------
Training F1-score: 0.7236739352435874

Validation F1-score: 0.7083158042350302

Analysis and Conclusions

In [220]:
vis.show_feat_importances(model, X_train)
<Figure size 1440x720 with 0 Axes>

The cluster features are taken into account, but they are not the most relevant, and don't seem to change the results too much. When the lagged features are added, there is a small improvement in the F1-score.

Test Results (only run this once, after adjusting all the hyperparameters)

In [221]:
evos.offer_success_test(model, basic_dataset_path=STATIC_DATASET_PATH)
Training time: 19.149096250534058 seconds.
--------------------------------------------TRAIN RESULTS--------------------------------------------
Confusion Matrix:
[[20455  6420]
 [ 6900 17033]]
Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.76      0.75     26875
           1       0.73      0.71      0.72     23933

   micro avg       0.74      0.74      0.74     50808
   macro avg       0.74      0.74      0.74     50808
weighted avg       0.74      0.74      0.74     50808

----------------------------------------------------------------------------------------------------
--------------------------------------------TEST RESULTS--------------------------------------------
Confusion Matrix:
[[10295  3693]
 [ 3232  8249]]
Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.74      0.75     13988
           1       0.69      0.72      0.70     11481

   micro avg       0.73      0.73      0.73     25469
   macro avg       0.73      0.73      0.73     25469
weighted avg       0.73      0.73      0.73     25469

----------------------------------------------------------------------------------------------------

___________________________________________________
| MAIN METRIC (test f1-score): 0.7043504247961406 |
---------------------------------------------------
Training F1-score: 0.7189043177309754

Test F1-score: 0.7043504247961406

8.B Profits in 10 Days experiments

Top

In [222]:
%reset -f
In [223]:
import pandas as pd
import numpy as np
import math
import json
import os
import matplotlib.pyplot as plt
%matplotlib inline

%load_ext autoreload
%autoreload 2

from jupyterthemes import jtplot
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

import data_utils_mt.utils as utils
import seaborn as sns
from sklearn.pipeline import Pipeline
import datetime as dt
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import mean_squared_error as mse
import scipy.stats as stats

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')
DATA_PROCESSED = os.path.join(DATA_DIR, 'processed')
SRC = os.path.join(ROOT_DIR, 'src')

STATIC_DATASET_PATH = os.path.join(DATA_PROCESSED, 'static_spent_10_days.pkl')

import sys
sys.path.append(SRC)

import src.data.preprocessing as pp
import src.data.success_dataset as sd
import src.data.missing_data as md
import src.evaluation.offer_success as evos
import src.data.profit_10_days_dataset as p10
import src.visualization.visualize as vis
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Get the data and show it

In [224]:
# Get the data
X_train, X_test, y_train, y_test, encoder, view_cols, profit_cols =\
p10.get_profit_10_days_data(fill_null=True, 
                        target=['viewed', 'profit_10_days'], drop_offer_id=False)
In [225]:
print(X_train.shape)
print(y_train.shape)
X_train.head()
(25319, 130)
(25319, 2)
Out[225]:
offer_id age gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t ... channel_mobile_success_ratio difficulty_viewcol duration_viewcol reward_t_viewcol channel_web_viewcol channel_mobile_viewcol channel_email_viewcol channel_social_viewcol offer_id_viewcol offer_type_viewcol
0 5a8bc65990b245e5a138643cd4eb9837 33.0 M 72000.0 0 17277 0.0 3.0 informational 0.0 ... 0.0 0.0 3.0 0.0 0.0 1.0 1.0 1.0 5a8bc65990b245e5a138643cd4eb9837 informational
5 f19421c1d4aa40978ebb69ca19b0e20d NaN None NaN 1 17646 5.0 5.0 bogo 5.0 ... 0.0 5.0 5.0 5.0 1.0 1.0 1.0 1.0 f19421c1d4aa40978ebb69ca19b0e20d bogo
7 3f207df678b143eea3cee63160fa8bed 40.0 O 57000.0 0 17540 0.0 4.0 informational 0.0 ... 0.0 0.0 4.0 0.0 1.0 1.0 1.0 0.0 3f207df678b143eea3cee63160fa8bed informational
8 2298d6c36e964ae4a3e7e9706d1fb8c2 40.0 O 57000.0 0 17540 7.0 7.0 discount 3.0 ... 0.0 7.0 7.0 3.0 1.0 1.0 1.0 1.0 2298d6c36e964ae4a3e7e9706d1fb8c2 discount
12 fafdcd668e3743c1bb461111dcafc2a4 59.0 F 90000.0 0 16864 10.0 10.0 discount 2.0 ... 0.0 10.0 10.0 2.0 1.0 1.0 1.0 1.0 fafdcd668e3743c1bb461111dcafc2a4 discount

5 rows × 130 columns

In [226]:
print(X_test.shape)
print(y_test.shape)
X_test.head()
(12778, 130)
(12778, 2)
Out[226]:
offer_id age gender income missing_demographics member_epoch_days difficulty duration offer_type reward_t ... channel_mobile_success_ratio difficulty_viewcol duration_viewcol reward_t_viewcol channel_web_viewcol channel_mobile_viewcol channel_email_viewcol channel_social_viewcol offer_id_viewcol offer_type_viewcol
2 no_offer 33.0 M 72000.0 0 17277 0.0 0.0 no_offer 0.0 ... 0.000000 5.0 5.0 5.0 1.0 1.0 1.0 1.0 f19421c1d4aa40978ebb69ca19b0e20d bogo
10 0b1e1539f2cc45b7b9fa7c272da2e1d7 40.0 O 57000.0 0 17540 20.0 10.0 discount 5.0 ... 0.333332 20.0 10.0 5.0 1.0 0.0 1.0 0.0 0b1e1539f2cc45b7b9fa7c272da2e1d7 discount
15 4d5c57ea9a6940dd891ad53e9dbe8da0 59.0 F 90000.0 0 16864 10.0 5.0 bogo 10.0 ... 0.666664 10.0 5.0 10.0 1.0 1.0 1.0 1.0 4d5c57ea9a6940dd891ad53e9dbe8da0 bogo
19 5a8bc65990b245e5a138643cd4eb9837 24.0 F 60000.0 0 17116 0.0 3.0 informational 0.0 ... 0.999995 0.0 3.0 0.0 0.0 1.0 1.0 1.0 5a8bc65990b245e5a138643cd4eb9837 informational
24 fafdcd668e3743c1bb461111dcafc2a4 26.0 F 73000.0 0 17338 10.0 10.0 discount 2.0 ... 0.333332 10.0 10.0 2.0 1.0 1.0 1.0 1.0 fafdcd668e3743c1bb461111dcafc2a4 discount

5 rows × 130 columns

Create the model

In [227]:
model = p10.ProfitsPredictor(encoder=encoder, view_cols=view_cols, profit_cols=profit_cols)

Evaluate the model

Due to the small amount of time instances, there will be no validation set, or grid search. Instead one model will be fitted and tested (a test set is separated).

In [228]:
%time model.fit(X_train, y_train)
CPU times: user 36.6 s, sys: 81.5 ms, total: 36.7 s
Wall time: 36.6 s
Out[228]:
ProfitsPredictor(encoder=BasicEncoderProfits(custom_features=None),
         profit_cols=['difficulty', 'duration', 'reward_t', 'channel_web', 'channel_mobile', 'channel_email', 'channel_social', 'offer_id', 'offer_type'],
         view_cols=['difficulty_viewcol', 'duration_viewcol', 'reward_t_viewcol', 'channel_web_viewcol', 'channel_mobile_viewcol', 'channel_email_viewcol', 'channel_social_viewcol', 'offer_id_viewcol', 'offer_type_viewcol'])

Analysis and Conclusions

In [229]:
n_feats = 20
In [230]:
plot_importance(model.views_model.named_steps['estimator'], max_num_features=n_feats)
Out[230]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbffa144b70>
In [231]:
plot_importance(model.profits_model.named_steps['estimator'], max_num_features=n_feats)
Out[231]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbffd8f7c88>
In [232]:
y_train_pred = model.predict(X_train)
print('Training error (RMSE) = {}'.format(np.sqrt(mse(y_train['profit_10_days'], y_train_pred))))
Training error (RMSE) = 52.261545076704756

Test Results (only run this once, after adjusting all the hyperparameters)

In [233]:
y_test_pred = model.predict(X_test)
print('Test error (RMSE) = {}'.format(np.sqrt(mse(y_test['profit_10_days'], y_test_pred))))
Test error (RMSE) = 67.31771140570076
In [234]:
n_bins = 50
plt.subplot(1,2,1)
_ = plt.hist(y_test_pred, bins=n_bins)
plt.title('Predicted test profits')
plt.subplot(1,2,2)
y_test['profit_10_days'].hist(bins=n_bins)
_ = plt.title('Actual test profits')
In [235]:
_ = sns.kdeplot(y_test_pred, shade=True, color="r", label='Predicted test profits')
_ = sns.kdeplot(y_test['profit_10_days'], shade=True, color="b", label='Actual test profits')
_ = plt.xlim((-50, 250))
_ = plt.title('Estimated densities for the predicted and real profits in 10 days')
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

Pseudo AB test

Ideally, to further test the model, it would be a good idea to perform an AB test. In a perfect situation we would be able to collect data from reality. In a less idealized world we would have access to a simlator, to which we can input offers. In the actual case, the only thing available is data from a simulator that was generated in the past, without any possibility of generating new data with our desired specifications. In any case, it is possible to perform what I call a "pseudo AB test".

The idea is to predict which are the best offers for each customer, then look for cases in the test set where, by chance, exactly those offers were sent. That will be the sample "with the new predictor". A random sample from the test set (of the same size as the one obtained before), will be the sample "without the predictor" (the samples may overlap, but that is not important; the important thing is that the second sample comes from the "old offer-sending procedure" distribution).

That method is far less than ideal. In particular the sample size is determined by chance and cannot be controlled. Despite that, its significance can be assessed.

In [236]:
# Read the data
portfolio = pd.read_json(os.path.join(DATA_RAW, 'portfolio.json'), orient='records', lines=True)
profile = pd.read_json(os.path.join(DATA_RAW, 'profile.json'), orient='records', lines=True)
transcript = pd.read_json(os.path.join(DATA_RAW, 'transcript.json'), orient='records', lines=True)

# Basic Preprocessing
print('Basic preprocessing')
%time _, portfolio = pp.basic_preprocessing(portfolio, profile, transcript)

portfolio
Basic preprocessing
CPU times: user 1.42 s, sys: 24.5 ms, total: 1.44 s
Wall time: 1.44 s
Out[236]:
difficulty duration id offer_type reward channel_mobile channel_web channel_email channel_social
0 10 7 ae264e3637204a6fb9bb56bc8210ddfd bogo 10 1 0 1 1
1 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 bogo 10 1 1 1 1
2 0 4 3f207df678b143eea3cee63160fa8bed informational 0 1 1 1 0
3 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 bogo 5 1 1 1 0
4 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 discount 5 0 1 1 0
5 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 discount 3 1 1 1 1
6 10 10 fafdcd668e3743c1bb461111dcafc2a4 discount 2 1 1 1 1
7 0 3 5a8bc65990b245e5a138643cd4eb9837 informational 0 1 0 1 1
8 5 5 f19421c1d4aa40978ebb69ca19b0e20d bogo 5 1 1 1 1
9 10 7 2906b810c7d4411798c6938adc9daaa5 discount 2 1 1 1 0
In [237]:
selected_offers, predicted_full = p10.choose_offer(model, X_test, portfolio)
selected_offers.head()
Out[237]:
0    fafdcd668e3743c1bb461111dcafc2a4
1    fafdcd668e3743c1bb461111dcafc2a4
2    fafdcd668e3743c1bb461111dcafc2a4
3    fafdcd668e3743c1bb461111dcafc2a4
4                            no_offer
dtype: object
In [238]:
num_samples = (X_test.offer_id == selected_offers.values).sum()
print('The old procedure sent the model\'s preferred offer in {:.1f}% of the cases'.format(
    100*(X_test.offer_id == selected_offers.values).mean()))
print('There are {} samples of the selected offers of the new model.'.format(num_samples))

# Old model
np.random.seed(2018)
old_res = y_test.loc[np.random.choice(y_test.index, num_samples, replace=False), :]
old_views = old_res['viewed']
old_profits = old_res['profit_10_days']
print('\n' + '-'*100)
print('Total views for the old model: {} ({:.1f}%)'.format(old_views.sum(), 100*old_views.sum()/num_samples))
print('Total profit for the old model: {}'.format(old_profits.sum()))
print('Mean and std for the old model: mean = {}, std = {}'.format(old_profits.mean(), old_profits.std()))

# New model
new_res = y_test[(X_test.offer_id == selected_offers.values)]
new_views = new_res['viewed']
new_profits = new_res['profit_10_days']
print('-'*100)
print('Total views for the new model: {} ({:.1f}%)'.format(new_views.sum(), 100*new_views.sum()/num_samples))
print('Total profit for the new model: {}'.format(new_profits.sum()))
print('Mean and std for the new model: mean = {}, std = {}'.format(new_profits.mean(), new_profits.std()))
print('-'*100)
The old procedure sent the model's preferred offer in 10.1% of the cases
There are 1290 samples of the selected offers of the new model.

----------------------------------------------------------------------------------------------------
Total views for the old model: 958 (74.3%)
Total profit for the old model: 54845.61
Mean and std for the old model: mean = 42.51597674418603, std = 64.98748870948882
----------------------------------------------------------------------------------------------------
Total views for the new model: 1063 (82.4%)
Total profit for the new model: 61957.53
Mean and std for the new model: mean = 48.02909302325582, std = 82.60714117173515
----------------------------------------------------------------------------------------------------

Let's check the significance of the test

In [239]:
pooled_profits = np.hstack([old_profits.values, new_profits.values])
mu_pooled = pooled_profits.mean()
sigma_pooled = pooled_profits.std()
mu_old = old_profits.mean()
mu_new = new_profits.mean()

z = (mu_new - mu_old) / sigma_pooled

print('z = {}'.format(z))
print('The p-value is: {}'.format(1 - stats.norm.cdf(z)))
z = 0.07415722393428752
The p-value is: 0.47044264122179047

So, at a significance level of 0.05, it is not possible to reject the null hypothesis. Anyway, that may only mean that we don't have enough data to decide. A properly designed experiment, with access to simulated or real data collecting, could reject the null hypothesis (one experiment, no more than that [to avoid the mistake of "trying until a good sample comes"], designed with the desired levels of significance and statistical power from the begining).

In [240]:
_ = sns.kdeplot(old_profits, shade=True, color="r", label='old')
_ = sns.kdeplot(new_profits, shade=True, color="b", label='new')
_ = plt.xlim((-50, 250))
_ = plt.title('Estimated densities for the old and new profits in 10 days')
/home/miguel/anaconda3/envs/sbucks/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [ ]: